bokulich-lab / RESCRIPt

REference Sequence annotation and CuRatIon Pipeline
BSD 3-Clause "New" or "Revised" License
85 stars 26 forks source link

[ENH] `trim-alignment`: how to deal with gappy primer alignments? #129

Open nbokulich opened 2 years ago

nbokulich commented 2 years ago

To improve appropriate use, some safety catches should be put in place to deal with poor primer alignments that lead to wonky trimming. This is specific to cases when primers are input, not positions.

Raised by @jwdebelius on the forum.

So far I don't think that there is a bug involved, just that mafft is not an in silico PCR tool (by design) and hence bad inputs will lead to bad alignments (and possibly trimming at the wrong position).

To repeat and expand on the discussion in the forum, I propose that we add one or more of the following improvements (not necessarily all):

  1. we should document the assumptions of this method (that the primers actually align well to the reference)
  2. we should document user best practices. This could be as much as a tutorial on how to QC the input database prior to using this method. It could be as little as recommending that users should manually inspect the alignments and/or outputs to check that length expectations are met.
  3. trim-alignment could raise a warning if the primers align with a large number of internal gaps. The warning could recommend manual inspection and re-running to trim on positions instead of alignments.
  4. trim-alignment could output a visualizer to allow direct inspection, i.e., to manually confirm and diagnose issues like this (whether or not a warning is raised).

Maximum effort: replace mafft with something else (blast?) to find the primer alignment position and to penalize internal gaps. This might find better trimming positions and be more in silico PCR-esque, but it would not solve the issue (as the case shown by @jwdebelius is basically that some of the ref seqs simply do not align well to the primer and should in fact be removed, not that they are ragged seqs that start downstream of the primer binding position).

Ultimately in cases like this I think the users should just fall back on manual inspection of the alignment and trimming at an explicit position, instead of the stop position of the primer in that alignment (which in this case is further than I would select manually, as mafft inserts a very large internal gap in the forward primer). OR for users who expect poor coverage to use q2-feature-classifier to extract reads.

mikerobeson commented 2 years ago

I just wanted to point out, that having many gaps inserted within the aligned primers is typical. Check out step 7 of the prototype code for make_SILVA_db. Note, the massive amount of positions that the primers span against the SILVA alignment. Yes, these are aligning where they should. So, how to gauge what too many gaps means, might take some work.

I really would love to work on a simple alignment viewer (qzv) for manual inspection. I think I hacked one up a while ago, but was very inefficient using bokeh. I think you had recommended that we pull from one of the other visualizers to simply have something to "look at". I forget which one that was... If simple enough we could try and work on that? I had the idea of having a visualizer where we can highlight consensus positions (or have a heat map, etc..) that would enable to user to find which alignment positions make the most sense to trim. Or better yet, enable the user to download the columns of the alignment that they'd like to keep (by user selectable options), export that as a text file, then use that as an explicit mask to filter the alignment.

As an aside, I'd like to bring over the much of the functionality of the old filter_alignment.py into q2-alignment.