gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

New multivalue field dnaSequenceID #1099

Open thomasstjerne opened 1 month ago

thomasstjerne commented 1 month ago
  1. Create a new searchable multivalue field named dnaSequenceID in the ES index
  2. When a DNA derived data extension has data in the field DNA_sequence, populate dnaSequenceID as follows:
    • Uppercase the DNA_sequence
    • Remove non IUPAC chars and gaps with a regex like /[^ACGTURYSWKMBDHVN]/g
    • Do a MD5 hash of the uppercased, cleaned sequence and insert it in dnaSequenceID

We want this to be a multivalue field because some occurrences may have multiple lines in DNA derived data extension.

tobiasgf commented 2 weeks ago

I suggest using the broader and more unambiguous: dnaSequenceID - it also indicates the connetion to the field (DNA_sequence) from which the value is derived.

Reasoning: "ASV" is strictly speaking a DNA sequence resulting from only some particular sequencing and bioinformatic processing pipelines, not all. BOLD sequences (e.g.) are for the major part Sanger sequences (not ASVs). Currently we do not identify and separate DNA sequences of different "types" or from different sources (environmental DNA or specimens etc) to handle them differently. Thus, we need a more accommodating term than asvID, I think.

thomasstjerne commented 2 weeks ago

I have updated this issue according to @tobiasgf ´s comment