gbif / doc-publishing-dna-derived-data

This guide shows how to publish DNA-derived spatiotemporal biodiversity data and make it discoverable through national and global biodiversity data discovery platforms. Based on experiences from Australia, Norway, Sweden, UNITE, and GBIF.
https://doi.org/10.35035/doc-vf1a-nr22
Other
2 stars 7 forks source link

clarify how to use 'associatedSequences' #199

Open sformel-usgs opened 7 months ago

sformel-usgs commented 7 months ago

A conversation with some other folks made me realize that in the gomecc data [1][2], we used aasociatedSequences to provide the NCBI identifiers (e.g. PRJNA887898) without an indication of the namespace, e.g. "https://www.ncbi.nlm.nih.gov/bioproject/PRJNA887898/", to make it easy for non-experts (and future experts) to know what this identifier refers to. This was probably due to bad advice from me, because when I see PRJNA, SAMN, SRR, etc., I think NCBI. If you look at the Darwin Core examples, they include the NCBI namespace in the URI.

Because of this faux pas, I'm wondering if this could be made for explicit in the DNA publishing guides. The OBIS manual describes it as:

associatedSequences should contain a link to the “raw” sequences deposited in a public database or list of identifiers for the genetic sequence associated with the occurrence record (e.g. GenBank). The actual sequence of the occurrence will be documented in the DNA Derived Data extension.

and then gives and example of NCBI BioProject acc. nr. PRJNA433203 in a table further on.

The GBIF guide more or less uses the DwC definition, which includes the namespace example:

A list (concatenated and separated) of identifiers (publication, global unique identifier, URI) of genetic sequence information associated with the Occurrence. Could be used for linking to archived raw barcode reads and/or associated genome sequences, e.g. in a public repository.

I don't think these are bad definitions, but I think the way 'identifier' is used, leaves it open for interpretation of what an identifier is. I'm not losing sleep over this one, but I'm curious if anyone else has an opinion on it? @tobiasgf @pieterprovoost @saarasuominen @ksil91?

pieterprovoost commented 7 months ago

@sformel-usgs @EliLawrence Good point, we'll update the OBIS manual. I think the example in the guide is pretty clear as it is.

tobiasgf commented 7 months ago

Good, point. I'll make sure to make this more precise in our guidance material, including "the guide".

EliLawrence commented 1 month ago

To clarify what you're suggesting @sformel-usgs, are you recommending that associatedSequences should include reference to the namespace, or that it does not have to? All the examples for associatedSequences you linked to are in the https://www.ncbi.... kind of format. Perhaps I am not entirely clear on what you mean by namespace in this context...

So if we updated the OBIS Manual documentation to

associatedSequences should contain a link, identifier, or list (concatenated and separated) of identifiers of genetic sequence information associated with the Occurrence. Can be used for linking to archived raw barcode reads and/or associated genome sequences, like in a public repository. Example: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA887898/. The actual sequence of the occurrence will be documented in the DNA Derived Data extension.

would this be sufficient?

sformel-usgs commented 1 month ago

@EliLawrence Yes, associatedSequences should include reference to the URL domain (what I'm calling namespace. Apologies if that isn't quite correct), to give meaning to the identifier. It's analogous to AphiaID vs LSID from WoRMS. Does that make sense?

EliLawrence commented 1 month ago

Yes that makes sense, thanks Steve! I've updated the OBIS Manual guidelines for this accordingly:

associatedSequences should contain a reference to the URL domain where genetic sequence information associated with the Occurrence can be found, e.g. a link, identifier, or list (concatenated and separated) of identifiers. Can link to archived raw barcode reads and/or associated genome sequences, like a public repository. It is recommended that links contain the domain name (e.g. NCBI) in the URL, for example: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA887898/. The actual sequence of the occurrence will be documented in the DNA Derived Data extension.

tobiasgf commented 1 month ago

Opening again to keep it on my to-do list for the GBIF docs.