Open gbif-portal opened 5 years ago
Some additional info:
Dataset link: https://zenodo.org/record/890000#.Y9gvFMnMKUk Taxon: Bacteria and Archaea Type: occurrence
License: CC-BY-NC 4.0 Bibliographic reference: https://doi.org/10.1038/nature24621
Comments: We collaborate on other tasks with one of the lead authors of this dataset. We are interested in approaching him to publish this dataset through the GBIF-US node. Dataholders contact information: Luke Thompson (luke.thompson@noaa.gov) Users contact info: Steve Formel (sformel@usgs.gov)
@tobiasgf I'm ready to work on this, but in doing some more investigation, I realized the EMP data is already in MGnify. However, I'm having trouble identifying whether it has already been mobilized to GBIF. Do you know anything about their process, or to whom I should ask questions?
@sformel-usgs That sounds good. Yes it is an interesting situation with these DNA datasets that gets reanalyzed - maybe even by several other infrastructures – and published (again) to GBIF.org. I am unsure if the specific dataset is already in GBIF through MGnify. MGnify analyses selected datasets. GBIF mediates MGnify datasets that fulfil some selected criteria (pt excluding host-associated datasets e.g.). It is possible to identify the provenance of the MGNify datasets in GBIF, and trace back to the original Bioprojects and Biosamples, but it could and should likely be made more explicit. GBIF already has a lot of "duplicated records" and that is not as such a problem. We try to connect these, when they bear identifiers allowing this. In this particular case, it is actually valuable to "duplicate" datasets already mediated from MGnify for one particular reason: presently the MGnify datasets can be seen as "taxonomic breakdowns" - they are collapsed/merged on taxon names, and do NOT carry OTUs/ASVs (in the dna-derived extension). This makes then less suitable for combining with other similar datasets (with sequences), and impossible to reanalyses from a sequence perspective. MiCoDa - an project/database that re-analyse 16S data from available ENA/SRA archives in a standardised way also includes the EMP, and that data may well be mediated in GBIF also soon. However these different routes/pipelines for the same original data, may all serve different audiences/users (e.g. snapshot of the original data pipeline used, or taxonomic breakdowns, or standardised data according to some SOP., etc..). Ideally GBIF should aim to link these datasets in the future and make the common source evident.
..."presently the MGnify datasets can be seen as "taxonomic breakdowns" - they are collapsed/merged on taxon names, and do NOT carry OTUs/ASVs (in the dna-derived extension)."
This is valuable to know, and I agree that it's worth pursuing the EMP data. I also agree that linking a re-analysis, like MiCoDa, is more sensible than excluding one of another set of analyses. I will keep this in the back of my head while I work on the EMP.
Earth microbiome project
Dataset link: http://www.earthmicrobiome.org/data-and-code/
Region: global
Taxon: bacteria and fungi
Type: metadata
Why is this important: this project was a global attempt to catalog microbes in samples taken all over earth (27,751 samples from 97 independent studies). There are soil, water, animal and plant 'microbiome' samples studied. Samples were collected and analyzed following uniform protocol. Samples were identified by 16s rRNA sequencing (bacteria) or 18s or ITS sequencing (fungi). Currently these data are not easily mapped or queried by non-data scientists.
Priority: low
Bibliographic reference: https://www.nature.com/articles/nature24621
Users contact info: jenrow