GLOMICON / asvBiomXchange

A repository to develop an exchange format for molecular biodiversity data
1 stars 4 forks source link

Method-agnostic naming scheme for ASVs: MD5 checksums #14

Open cuttlefishh opened 5 years ago

cuttlefishh commented 5 years ago

One of the advantages of using exact-sequence ASVs is that the sequence is the OTU. Multiple BIOM tables can be merged because the ASV IDs are not dependent on some external OTU database. In fact, in Deblur, the FASTA headers for the representative sequences were originally the sequences themselves. But this becomes cumbersome. DADA2 has a different solution, where the FASTA headers for the representative sequences are MD5 checksums of the sequences.

MD5 checksums are great because the MD5 hash function takes a string or an entire file and generates a sequence of 32 hexadecimal digits, reproducibly for the same string/file but (essentially) unique for every string/file. For example, if your ASV sequence is "GATTACA" (very short for this example), the MD5 checksum would be:

echo "GATTACA" | md5
e19467a4b7737544289ceed3a4761fc8

This seems like something we should be adopting for our reference sequences, to track ASV sequences across different studies.

Luke

pbuttigieg commented 5 years ago

Agreed - I believe SWARM does this too to name its OTUs