One of the advantages of using exact-sequence ASVs is that the sequence is the OTU. Multiple BIOM tables can be merged because the ASV IDs are not dependent on some external OTU database. In fact, in Deblur, the FASTA headers for the representative sequences were originally the sequences themselves. But this becomes cumbersome. DADA2 has a different solution, where the FASTA headers for the representative sequences are MD5 checksums of the sequences.
MD5 checksums are great because the MD5 hash function takes a string or an entire file and generates a sequence of 32 hexadecimal digits, reproducibly for the same string/file but (essentially) unique for every string/file. For example, if your ASV sequence is "GATTACA" (very short for this example), the MD5 checksum would be:
One of the advantages of using exact-sequence ASVs is that the sequence is the OTU. Multiple BIOM tables can be merged because the ASV IDs are not dependent on some external OTU database. In fact, in Deblur, the FASTA headers for the representative sequences were originally the sequences themselves. But this becomes cumbersome. DADA2 has a different solution, where the FASTA headers for the representative sequences are MD5 checksums of the sequences.
MD5 checksums are great because the MD5 hash function takes a string or an entire file and generates a sequence of 32 hexadecimal digits, reproducibly for the same string/file but (essentially) unique for every string/file. For example, if your ASV sequence is "GATTACA" (very short for this example), the MD5 checksum would be:
This seems like something we should be adopting for our reference sequences, to track ASV sequences across different studies.
Luke