Update script to generate darwin core archive for molecular data following some GBIF recommendations

mskyttner commented 5 years ago

https://data-blog.gbif.org/post/gbif-molecular-data/ suggests a few options for how to include sequence data in a darwin core archive when publishing to GBIF. Even though Living Atlases currently don't provide exactly the same implementation as GBIF when indexing, we should try to follow the recommendations.

Suggestion:

Add, in addition to what there is already, another field, using occurrence_core and the term https://dwc.tdwg.org/terms/#associatedSequences which allows for a pipe separated list (concatenated and separated) of identifiers (publication, global unique identifier, URI) of genetic sequence information associated with the Occurrence, if those kinds of identifiers are publicly available for the sequences.

Also add, in addition to what there is already, another field, using occurrence_core and the term http://rs.tdwg.org/dwc/terms/#dwc:dynamicProperties with json-encoded values for relevant fields used in the ggbn extension, similar to:

{"primerSequenceForward":"CCTACGGGNGGCWGCAG","primerSequenceReverse":"GACTACHVGGGTATCTAATCC","primerNameForward":"341F","primerNameReverse":"805R","barcodeSequence":"TCTTTCACCAGGGACGAAGCGCAAGTGACGGTACCTGGAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTTGTTCGTGAAATCTCACAGCTCAACTGTGGGCGTGCGGGCGATACGGGCAGACTGGAGTACTGCAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA"}

pragermh commented 5 years ago

We can include associatedSequences, when available, but also need to store the actual sequences (and primer sequences) so that users can search directly for these in BAS. So, yes, I think we should try the dwc:dynamicProperties alternative next.

mskyttner commented 5 years ago

@pragermh I noticed that you closed this issue but I cannot see that you have added these terms yet in the script so we can get these in the dwca-export?

pragermh commented 5 years ago

Oops, sorry. New indata and derived zip in folder using-existing-taxa.

mskyttner commented 5 years ago

@pragermh in 44159cb1 I updated the script to add a column with a string with JSON from all of the fields in the ggbn extension, the JSON string there is in a format that can be deserialized into tabular format easily that could be used by an external application for purposes such as supporting searches on these strings.

mskyttner commented 5 years ago

@pragermh before closing this I'd like to double check with you whether you will have any data for the associatedSequences term, ie external identifiers (publication, global unique identifier, URI) of genetic sequence information associated with the Occurrence?

pragermh commented 5 years ago

For prokaryote data, we will typically lack links to (denoised) sequences, but I should probably include the associatedSequences term in the future, as it may be useful for other organism groups. I have already included materialSampleID which should hold links to e.g. ENA samples. They will, in turn, point to raw sequence reads, but there is no connection between specific reads and and a certain occurrence, unfortunately.

mskyttner commented 5 years ago

It sounds like you have all terms you currently need in the script and data, and it also can generate the JSON for dynamicProperties and it follows the recommendations from GBIF so let's close this issue.

bioatlas / data-mobilization-pipeline

Update script to generate darwin core archive for molecular data following some GBIF recommendations #2