gbif / data-mobilization

For capturing and discussing potential datasets suitable for publishing to GBIF
Apache License 2.0
13 stars 2 forks source link

Global spores dataset #455

Open gbif-portal opened 4 months ago

gbif-portal commented 4 months ago

Global spores dataset

Dataset link: https://zenodo.org/records/10896659

Region: Global

Taxon: Fungi

Type: sampling event

Priority: medium

License: CC-BY

Bibliographic reference: https://www.nature.com/articles/s41586-024-07658-9

Dataholders contact information: corresponding authors?

Users contact info: dschigel

spalp commented 4 months ago

The occurrence was published originally by Ovaskainen et al. Data from: Global Spore Sampling Project: A global standardized dataset of airborne fungal DNA. https://doi.org/10.5281/zenodo.10435615 (2024)." published under CC-BY: I would thus rather contact Ovaskainen...

Or since the dataset has also been deposited at the ENA European Nucleotide Archive. https://identifiers.org/ena.embl:PRJEB65748 perhaps is already incorporated here: https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2#description? I tried filtering the INSDC Sequences by the project number but this did not return any results: image

Could else could one check whether this dataset was incorporated? Should we contact the publisher? @CecSve @dagendresen

ManonGros commented 3 months ago

I couldn't find the records mentioned in ENA. Maybe I am just not looking at the right place or maybe they were excluded.Even if they were included, all the sequences have the species name "air metagenome" in ENA which wouldn't be very helpful on GBIF.

CecSve commented 3 months ago

I couldn't find the records mentioned in ENA. Maybe I am just not looking at the right place or maybe they were excluded.Even if they were included, all the sequences have the species name "air metagenome" in ENA which wouldn't be very helpful on GBIF.

I think they have scientific names also based on the description in ENA:

Each OTU is accompanied by a probabilistic taxonomic classification, validated through comparison with expert evaluations.

It might be worthwhile contacting Ovaskainen.

tobiasgf commented 3 months ago

They have likely only submitted the raw data (fastq files) to ENA. They need to be (re-)analysed to get any meaningful taxonomy. People usually do not share the inferred/denoised sequences (often referred to as ASVs or OTUs) as individual records in "GenBank", and this is also not recommended.

The above linked zenodo archive (https://zenodo.org/records/10896659) mentions that the dataset itself was published in another archive (Ovaskainen et al. Data from: Global Spore Sampling Project: A global standardized dataset of airborne fungal DNA. https://doi.org/10.5281/zenodo.10435615 (2024). That is this one: https://zenodo.org/records/11125610.

That archive seems to include files (otu.gz, otu_table.rds, otu_taxonomy.rds, read_counts.tsv) that with a bit of formatting would be suitable to parse with the GBIF eDNA metabarcoding tool.

One would need to download the data and see how it can be fitted to one of the available templates.

Let me know if I can be of help with the formatting. It is probably best sone in collaboration with the authors.

spalp commented 3 months ago

Thank you all for the clarifications. I will contact Ovaskainen and let you know how it went.

@tobiasgf Thank you for offering help with the formatting. I have never worked with DNA data but what I can do is to go through the explanations available in the User Guide for the GBIF DNA metabarcoding data converter (in prep.). And of course, if Ovaskainen agrees to publish on GBIF and we'll be needing assistance, I will gladly take the offer.

Btw, trying to access the GBIF eDNA metabarcoding tool., I get the following message: You don't have permission to access this resource. Should I request some permission and from whom?

tobiasgf commented 3 months ago

Super.

The tool is still a prototype, but fully functional for making a DwC-A from an OTU table in one of the four template shapes. The guide is not finished, but should contain enough information to let you get the idea of how to use it. I can access the website/link with no problems. Is it the login that gives you problems? You need to use your UAT login.

CecSve commented 3 months ago

@spalp we are in the process of figuring out how to ensure all contractors are properly trained in the tool when it is more mature, so please use the DNA-derived guide for now. Please ping @tobiasgf or helpdesk if you require further help and thank you for taking the lead on this one!

CecSve commented 3 months ago

Btw - since the first author is based in Finland and the last author is based in Sweden, the node(s) should be involved if the authors are keen on mobilizing the data for GBIF.

tobiasgf commented 3 months ago

I just met Nerea Abrego at the the International Mycological Congress. When I approach her about the dataset and the possibility to get it in GBIF, her response was: "It is already published and in public domain, you can just grab it.". We agreed that we would take a look at the data, and see how much work it would take to get it prepared as a DwC-Archive, and then reach out to them. She confirmed that the dataset is massive and likely not something to juggle around in Excel or similar. Let me know if somebody is already on the task of exploring? If not, I will take a look at this with some help (from @thomasstjerne ), to explore if we can use the prototype eDNA tool or some existing scripts for data-wrangling of eDNA data to get this in shape.

CecSve commented 3 months ago

Please go ahead @tobiasgf and @thomasstjerne. Please notify the Finnish and Swedish nodes before you begin - the data should probably be officially published but one of them.

thomasstjerne commented 1 week ago

@wkmor1 I ran this through the UAT MDT as follows:

  1. Download the files otu.table.csv, metadata.csv and taxonomy.csv from Zenodo
  2. Rename the first header in taxonomy.csv from OTU to id
  3. Rename the first header in metadata.csv from sample.id to id
  4. Upload to the MDT and go through mapping, metadata etc as usual. (Be patient, the metrics took maybe 30 minutes to complete)

Here is the dataset in UAT: https://www.gbif-uat.org/dataset/00c6665b-4826-4cee-8419-54ee0c359f27

wkmor1 commented 1 week ago

@thomasstjerne Great! I will see if I can replicate in our installation.