a-h-b / dadasnake

Amplicon sequencing workflow heavily using DADA2 and implemented in snakemake
GNU General Public License v3.0
45 stars 17 forks source link

Question on representative sequences #38

Closed juismo closed 11 months ago

juismo commented 11 months ago

Hi ahb, many thanks for including the post-clustering step. I have two questions related to the fasta-files (representative sequences) after post-processing:

  1. Is the filtered.seqs.fasta file based on the most abundant ASV within a cluster? And are these sequences represented as Row.names in the final table as we already know from ASV tables without clustering? Or maybe on centroid sequences?
  2. The filtered.consensus.fasta file is based on the definition from vsearch (=taking the majority symbol (nucleotide or gap) from each column of the alignment), right? Thanks, ju(is)mo
a-h-b commented 11 months ago

Hehe :-)

  1. filtered.seqs.fasta is based on all ASVs (so no); the clustered sequences are not represented in the ASV table. They are represented in the clusteredTab, as Row.names.
  2. yes, filtered.consensus.fasta contains the filtered 'OTUs'/clustered ASVs after filtering - if you use vsearch, these are consensus sequences as defined by vsearch: "For each cluster, a multiple alignment is computed, and a consensus sequence is constructed by taking the majority symbol (nucleotide or gap) from each column of the alignment. Columns containing a majority of gaps are skipped, except for terminal gaps."; if you use decipher, they are actually just a random sequence from the cluster (maybe the one first in the alphabet??) :-) ahb
juismo commented 11 months ago

Okay, thank you!