evolbioinfo / goalign

Goalign is a set of command line tools and an API to manipulate multiple sequence alignments. It is implemented in Go language.
GNU General Public License v2.0
71 stars 8 forks source link

suggested enhacement: mask w.r.t. to reference #6

Closed roblanf closed 4 years ago

roblanf commented 4 years ago

Hi,

A suggested enhancement to mask, please feel free to ignore!

One useful thing for masking sequences is to mask them with respect to positions in an un-gapped reference. E.g. imagine I had:

ref1    AAC---GGG
seq1    AACTTTGGA

I may know that I want to mask based on position 6 of the reference sequence before it was aligned. That position ends up being position 9 in this alignment. This can be a headache in large projects because I can't know in advance if the non-reference sequence will have insertions, so it's a pain to figure out by hand where the position I want to mask has ended up in the final alignment.

In this case my desired output would be:

ref1    AAC---GGN
seq1    AACTTTGGN

and I might think of a command line something like

goalign mask -ref 'ref1' -s 6 -l 1

The key difference being that if one specifies some flag like -ref [ref_seq_name] then the -s and -l flags would correspond to ungapped positions in the named reference sequence, not positions in the alignment.

Rob

fredericlemoine commented 4 years ago

Hi Rob,

Thanks for the suggestions, I already implemented --ref-seq option for other commands, so it was not so difficult. However, so far all insertions compared to the reference sequence are also masked. I don't know if it is desirable.

You can test it on the v0.3.3c pre release.

Frederic

roblanf commented 4 years ago

Thanks so much for implementing this. I tested it an it works exactly as I wanted. I'm a little confused only because that doesn't seem to quite match what you described (probably just me misunderstanding, but I wanted to clarify in case I'm missing something important).

Example, using v0.3.3c:

test.fa:

>ref1
AAC---GGGTTT
>seq1
AACTTTGGA---

./goalign_amd64_darwin mask -i test.fa -l 1 -s 5 --ref-seq 'ref1'

output:

>ref1
AAC---GGNTTT
>seq1
AACTTTGGN---

This is exactly what I want - the position is masked in the sequence, and nothing else has changed.

My only confusion is that you said "all insertions compared o the reference are also masked", which meant I was expecting bases 4-6 to be masked in seq1. Note that this is not I what I wanted - what I want is exactly what it's doing right now!! Just thought I'd highlight my confusion.

Thanks again!

fredericlemoine commented 4 years ago

I was not very clear indeed.

If I take your example:

>ref1
AAC---GGGTTT
>seq1
AACTTTGGA---

If you mask positions [2-3] in reference coordinates (goalign mask -s 2 -l 2 --ref-seq ref1) you will end up with:

>ref1
AANNNNNGGTTT
>seq1
AANNNNNGA---

It also masks insertions/gaps that are in the interval.

roblanf commented 4 years ago

AH, OK, got it. That makes sense. One way to describe this is something like:

"If the range of masked sites incorporates gaps in the reference sequence, these gaps will also be masked in the output alignment"

Thanks again - this is really awesome.

fredericlemoine commented 4 years ago

Thanks, I modified the documentation and the help with your suggestion. I close the issue, feel free to reopen it if needed.