Closed roblanf closed 4 years ago
Hi Rob,
Thanks for the suggestions, I already implemented --ref-seq
option for other commands, so it was not so difficult.
However, so far all insertions compared to the reference sequence are also masked. I don't know if it is desirable.
You can test it on the v0.3.3c pre release.
Frederic
Thanks so much for implementing this. I tested it an it works exactly as I wanted. I'm a little confused only because that doesn't seem to quite match what you described (probably just me misunderstanding, but I wanted to clarify in case I'm missing something important).
Example, using v0.3.3c:
test.fa:
>ref1
AAC---GGGTTT
>seq1
AACTTTGGA---
./goalign_amd64_darwin mask -i test.fa -l 1 -s 5 --ref-seq 'ref1'
output:
>ref1
AAC---GGNTTT
>seq1
AACTTTGGN---
This is exactly what I want - the position is masked in the sequence, and nothing else has changed.
My only confusion is that you said "all insertions compared o the reference are also masked", which meant I was expecting bases 4-6 to be masked in seq1. Note that this is not I what I wanted - what I want is exactly what it's doing right now!! Just thought I'd highlight my confusion.
Thanks again!
I was not very clear indeed.
If I take your example:
>ref1
AAC---GGGTTT
>seq1
AACTTTGGA---
If you mask positions [2-3] in reference coordinates (goalign mask -s 2 -l 2 --ref-seq ref1
) you will end up with:
>ref1
AANNNNNGGTTT
>seq1
AANNNNNGA---
It also masks insertions/gaps that are in the interval.
AH, OK, got it. That makes sense. One way to describe this is something like:
"If the range of masked sites incorporates gaps in the reference sequence, these gaps will also be masked in the output alignment"
Thanks again - this is really awesome.
Thanks, I modified the documentation and the help with your suggestion. I close the issue, feel free to reopen it if needed.
Hi,
A suggested enhancement to
mask
, please feel free to ignore!One useful thing for masking sequences is to mask them with respect to positions in an un-gapped reference. E.g. imagine I had:
I may know that I want to mask based on position 6 of the reference sequence before it was aligned. That position ends up being position 9 in this alignment. This can be a headache in large projects because I can't know in advance if the non-reference sequence will have insertions, so it's a pain to figure out by hand where the position I want to mask has ended up in the final alignment.
In this case my desired output would be:
and I might think of a command line something like
goalign mask -ref 'ref1' -s 6 -l 1
The key difference being that if one specifies some flag like
-ref [ref_seq_name]
then the -s and -l flags would correspond to ungapped positions in the named reference sequence, not positions in the alignment.Rob