gbif / data-mobilization

For capturing and discussing potential datasets suitable for publishing to GBIF
Apache License 2.0
12 stars 2 forks source link

Earth microbiome project #163

Open gbif-portal opened 5 years ago

gbif-portal commented 5 years ago

Earth microbiome project

Dataset link: http://www.earthmicrobiome.org/data-and-code/

Region: global

Taxon: bacteria and fungi

Type: metadata

Why is this important: this project was a global attempt to catalog microbes in samples taken all over earth (27,751 samples from 97 independent studies). There are soil, water, animal and plant 'microbiome' samples studied. Samples were collected and analyzed following uniform protocol. Samples were identified by 16s rRNA sequencing (bacteria) or 18s or ITS sequencing (fungi). Currently these data are not easily mapped or queried by non-data scientists.

Priority: low

Bibliographic reference: https://www.nature.com/articles/nature24621

Users contact info: jenrow

sformel-usgs commented 1 year ago

Some additional info:

Dataset link: https://zenodo.org/record/890000#.Y9gvFMnMKUk Taxon: Bacteria and Archaea Type: occurrence

License: CC-BY-NC 4.0 Bibliographic reference: https://doi.org/10.1038/nature24621

Comments: We collaborate on other tasks with one of the lead authors of this dataset. We are interested in approaching him to publish this dataset through the GBIF-US node. Dataholders contact information: Luke Thompson (luke.thompson@noaa.gov) Users contact info: Steve Formel (sformel@usgs.gov)

sformel-usgs commented 1 year ago

@tobiasgf I'm ready to work on this, but in doing some more investigation, I realized the EMP data is already in MGnify. However, I'm having trouble identifying whether it has already been mobilized to GBIF. Do you know anything about their process, or to whom I should ask questions?

tobiasgf commented 1 year ago

@sformel-usgs That sounds good. Yes it is an interesting situation with these DNA datasets that gets reanalyzed - maybe even by several other infrastructures – and published (again) to GBIF.org. I am unsure if the specific dataset is already in GBIF through MGnify. MGnify analyses selected datasets. GBIF mediates MGnify datasets that fulfil some selected criteria (pt excluding host-associated datasets e.g.). It is possible to identify the provenance of the MGNify datasets in GBIF, and trace back to the original Bioprojects and Biosamples, but it could and should likely be made more explicit. GBIF already has a lot of "duplicated records" and that is not as such a problem. We try to connect these, when they bear identifiers allowing this. In this particular case, it is actually valuable to "duplicate" datasets already mediated from MGnify for one particular reason: presently the MGnify datasets can be seen as "taxonomic breakdowns" - they are collapsed/merged on taxon names, and do NOT carry OTUs/ASVs (in the dna-derived extension). This makes then less suitable for combining with other similar datasets (with sequences), and impossible to reanalyses from a sequence perspective. MiCoDa - an project/database that re-analyse 16S data from available ENA/SRA archives in a standardised way also includes the EMP, and that data may well be mediated in GBIF also soon. However these different routes/pipelines for the same original data, may all serve different audiences/users (e.g. snapshot of the original data pipeline used, or taxonomic breakdowns, or standardised data according to some SOP., etc..). Ideally GBIF should aim to link these datasets in the future and make the common source evident.

sformel-usgs commented 1 year ago

..."presently the MGnify datasets can be seen as "taxonomic breakdowns" - they are collapsed/merged on taxon names, and do NOT carry OTUs/ASVs (in the dna-derived extension)."

This is valuable to know, and I agree that it's worth pursuing the EMP data. I also agree that linking a re-analysis, like MiCoDa, is more sensible than excluding one of another set of analyses. I will keep this in the back of my head while I work on the EMP.