Adds reverse complementing to orient-seqs

jwdebelius commented 2 years ago

As discussed in #119.

Probably worth mentioning that I see some utility in keeping reverse complemented for aligned sequences, which is whats currently implemented, but I haven't explored the implications for orienting to a reference where it probably does break vserach, and might be a motivation for a separate method.

nbokulich commented 2 years ago

Hi @jwdebelius thanks for this! 🎉

Some initial ideas re: the point you raise about aligned seqs

I agree with you: reverse orientation of aligned sequences is a prominent use case, and we should support. But we do not want to introduce a bug re: VSEARCH compatability.

This might be a case for using TypeMap — i.e., so that FeatureData[AlignedSequence] is only accepted as input when reference_sequences=None. If that works, we will need to make the usage clear in the help documentation.

We could also introduce a separate orientation parameter (with options same, complement, reverse, or reverse complement), which might be a bit easier to use with TypeMap, though only if we think these options would be useful.

Another idea that might be cleaner and more functional. Accept aligned seqs as input with an unaligned ref.

degap a copy of the aligned seqs on input
run the degapped seqs through vsearch. Vsearch reports the orientation of each seq (+ / -)
orient the original aligned seqs based on the sign, and output

No matter which route we take, could you please add unit test(s) for orienting aligned seqs with the various options? I think that the current setup actually will not work with aligned sequences, since the view type is DNAFASTAFormat... I believe a DNAIterator is needed instead.

@mikerobeson any thoughts?

mikerobeson commented 2 years ago

For whatever reason I just noticed this. I'll look into it now.

I do not currently have much to @nbokulich's comments. Although @jwdebelius and I were discussing the old filter_alignment.py script from QIIME 1. In particular, the ability to remove poorly aligned sequences using the --remove_outliers and associated flags. Perhaps we can add a separate action in either the q2-alignment plugin, or rescript simply called detect-alignment-outliers?

This would set us up for a couple of downstream use cases: 1) detect the poorly aligned sequences (write the sequences and/or their ids to file), then investigate / filter downstream, using one of the QIIME 2 standard filtering commands 2) use output from 1, to specify which sequences in the FASTA file should be reverse-complemented, leave the others as is.

nbokulich commented 2 years ago

that sounds very useful! also a very different use case, i.e., for filtering alignments on quality (which sounds like a method for q2-alignment). Sounds like the q1 method also masked the alignments? So a very different beast (and one that is partially done now in q2-alignment?). Either way, filtering alignments sounds like very different functionality from read orientation.

So I propose that

we stick with the original plan to support [aligned] seq orientation in rescript
you open a separate issue for alignment filtering in q2-alignment @mikerobeson

@jwdebelius how would you like to proceed? See my note above about your current PR most likely not working for aligned seqs... I propose two options:

you add fully functional support for alignment orientation, incl. unit test(s) (see my ideas above)
you drop alignment support from the current PR. We can address this in a future PR.

I am happy with either path 😁

jwdebelius commented 2 years ago

Hi @nbokulich,

Sorry, things have been umm... busy the last 3 months. I'm a fan of the second option. I'll clean up the PR, and hopefully you can check it and it can be merged.

nbokulich commented 2 years ago

hi @jwdebelius, thanks for updating this PR! Please let me know when this is ready to review

jwdebelius commented 2 years ago

Thanks @nbokulich, I think it should be good to go!

jwdebelius commented 2 years ago

I hope this addresses your concerns @nbokulich! There are a couple of things I wasn't sure how to address directly

jwdebelius commented 2 years ago

I think this should address your documentation issues, @nbokulich.

bokulich-lab / RESCRIPt

Adds reverse complementing to orient-seqs #121