bokulich-lab / RESCRIPt

REference Sequence annotation and CuRatIon Pipeline
BSD 3-Clause "New" or "Revised" License
89 stars 26 forks source link

IMP: add alignment trimming action #85

Closed misialq closed 3 years ago

misialq commented 3 years ago

Closes #33.

This PR adds a new trim-alignment action allowing to trim an existing alignment based on either: aligning a set of primers or explicitly specifying positions within alignment. In the first case, a new alignment is first created using mafft_add action from q2-alignment, primer positions are identified and the original alignment is trimmed using those. In the second case, user can specify a set of positions and the alignment will be trimmed accordingly.

mikerobeson commented 3 years ago

Nice work @misialq! I think we should include a subsample option, to subsample ~1000 sequences from a reference alignment. The reason being:

Many will likely use this to trim sections out of a SILVA alignment, witch often contains ~500k to a few million sequences with 50k alignment columns. As currently implemented, this trimming operation will not run on many user systems due to the large memory requirements. See, the lengthy "step 7" of my prototype approach here, where I simply used smaller set of the alignment to determine the alignment positions of the primers.

Also, it might be worth considering to add the --addfragments option of mafft to the mafft plugin. This flag is optimized to add short reads (e.g. primers) to a large alignment. The default --add may be okay, but I can foresee where this approach may not work as well for our use cases. Though we can worry about adding this option later.

nbokulich commented 3 years ago

@mikerobeson

I think we should include a subsample option

do you think that should be an option in this action or a separate action, maybe in a separate plugin? I do not have a fully formed opinion on this, just putting the idea out there.

mikerobeson commented 3 years ago

do you think that should be an option in this action or a separate action, maybe in a separate plugin? I do not have a fully formed opinion on this, just putting the idea out there.

That is a good idea! I agree that this would best as a separate action and/or within a separate plugin, and highlight its use within a tutorial. Especially for SILVA. I basically have to subsample SILVA in order to use this bit of code on my laptop.

misialq commented 3 years ago

Thanks for your comments @mikerobeson and @nbokulich!

nbokulich commented 3 years ago

subsampling: VSEARCH has a subsampling method that we can wrap for this... presumably it also works for alignned sequences? It surprises me that a fasta subsampler is not extant in some other QIIME 2 plugin, so if possible it would make sense to accept either aligned or unaligned sequences as input, and use TypeMap to output whatever type was input. What do you both think?

Hence, I decided to do it this way, add support for addfragment in the other plugin and then change here accordingly. What do you think?

Makes sense to me — what do you think @mikerobeson ?

misialq commented 3 years ago

I agree that it should support both formats. I just gave it a quick try: seems like vsearch does not work with alignments... Will come up with some other solution then.

mikerobeson commented 3 years ago

RE addfragments: I am totally fine keeping with mafft --add for the short term and implementing --addfragments to the mafft plugin later. I just wanted to have it on our radar.

RE subsample: I am also surprised sub-sampling is not available as it was in QIIME 1 through subsample_fasta.py. The closest I can find is qiime genome-sampler sample-random from genome-sampler. Though this only subsamples IDs, which then you can use to filter other items. Perhaps we can borrow from either of these implementations?

nbokulich commented 3 years ago

RE subsample: @thermokarst actually described this awesome workaround on the forum.

so that functionality does exist in QIIME 2, it's just not very obvious. As discussed there, it would be great to add that method somewhere (@misialq a chance to contribute either in RESCRIPt or another plugin if you are interested, maybe q2-feature-table would make sense)... but for the time being we could fall back on that workaround (i.e., I do not think this is a blocker for this PR)

mikerobeson commented 3 years ago

RE subsample: @thermokarst actually described this awesome workaround on the forum.

This is indeed awesome. However, I just tried subsampling the SILVA alignment using this approach and it died, too much memory.

I think we should make use of this super elegant approach, which simply iterates through the file and writes as it goes. Should be quick to implement. I can add this as a separate action through another PR, unless @misialq wants to do it? :-)

misialq commented 3 years ago

Thanks for testing @mikerobeson! Sure can do, I was going to look at the subsampling after this PR is done. Speaking of which, do you still want to have a look at it or how are we feeling about moving forward?