gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Incorrect endpoint for Geographically tagged INSDC sequences #644

Closed rdmpage closed 5 years ago

rdmpage commented 7 years ago

The dataset "Geographically tagged INSDC sequences" https://www.gbif.org/dataset/ad43e954-dd79-4986-ae34-9ccdbd8bf568 hasn't been crawled successfully since 26 August 2014. Browsing the EBI FTP site there is a Darwin Core Archive at http://ftp.ebi.ac.uk/pub/databases/ena/biodiversity/occurrences/occurrences.tar.gz It has been recently updated (I'm guessing the EBI have a series of scripts that update this archive). If the endpoint is changed then a lot of geotagged DNA sequences should become available in GBIF.

dschigel commented 7 years ago

Thanks, Rod, well spotted and our earlier attempts to fix this through publisher were not successful (not even though Science Committee membership :). Can try again...Even if the dataset is back to crawling there are mapping issues - one sequence should not correspond to one occurrence if data are coming from next-gen sequencing. The dataset contains the infamous case of https://www.gbif.org/occurrence/search?dataset_key=ad43e954-dd79-4986-ae34-9ccdbd8bf568&taxon_key=7261875, where 0.5M "occurrences" are sequences from a single bat individual from a cave in Siberia. My approach would be to kill & remap, not to revive & recrawl, as the latter would preserve the data quality issues from the past. I am counting on your new export & mapping! Ideally, we would like to have regular data exports from and by INSDC (attribution, just like in the BOLD case), but I see it second priority compared to fixing data lags. Nota bene @ahahn-gbif @kbraak @jlegind

rdmpage commented 7 years ago

@dschigel Sure, but in the meantime we could simply edit the endpoint for the EMBL data so that it is correct. The EBI FTP site has some other folders relating to event and sample data, so it looks like someone is experimenting with other approaches. The work I’m doing is focussing on standard sequences, perhaps the nextgen stuff should be treated separately because of the issue you raise about the bat sequence (among other problems). But we have the same issue with regular sequences, some will be from same specimen, hence 1 sequence = 1 occurrence won’t always be true. Question is how far to go to try and cluster sequences into groups derived from same sample, and whether there is enough data in INSDC to enable us to do that.

Get Outlook for iOShttps://aka.ms/o0ukef


From: dschigel notifications@github.com Sent: Tuesday, November 7, 2017 3:53:36 PM To: gbif/portal-feedback Cc: Roderic Page; Author Subject: Re: [gbif/portal-feedback] Incorrect endpoint for Geographically tagged INSDC sequences (#644)

Thanks, Rod, well spotted and our earlier attempts to fix this through publisher were not successful (not even though Science Committee membership :). Can try again...Even if the dataset is back to crawling there are mapping issues - one sequence should not correspond to one occurrence if data are coming from next-gen sequencing. The dataset contains the infamous case of https://www.gbif.org/occurrence/search?dataset_key=ad43e954-dd79-4986-ae34-9ccdbd8bf568&taxon_key=7261875, where 0.5M "occurrences" are sequences from a single bat individual from a cave in Siberia. My approach would be to kill & remap, not to revive & recrawl, as the latter would preserve the data quality issues from the past. I am counting on your new export & mapping! Ideally, we would like to have regular data exports from and by INSDC (attribution, just like in the BOLD case), but I see it second priority compared to fixing data lags. Nota bene @ahahn-gbifhttps://github.com/ahahn-gbif

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/gbif/portal-feedback/issues/644#issuecomment-342527056, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAFFavqT609O1gNWjW6aDLGFoTBN9J1Aks5s0H0AgaJpZM4QU-ro.