Option to filter sites/columns in alignment based on percentage of identity

rguerillot commented 4 years ago

Thanks for developing this great alignment processing tool. I would be very interested if you could add an option that would allow to filter sites/columns in an alignment based on conservation across all sequences in the alignment (percentage of identity, eg ). This would be very helpful for constructing "relaxed" core variants from multiple genomes alignment.

Romain

fredericlemoine commented 4 years ago

Hi Romain,

Thanks for using goalign and for your suggestion. How would you compute this conservation score? Would it be, for a given column, the number of sequences that are different from a reference ? Or the number of different alleles or the « entropy » of the column for example?

Frederic

rguerillot commented 4 years ago

Hi Frederic,

For the conservation score I was thinking of:

most abundant character (base or amino acid or N or X or -) in the column / total number of characters in the column.

Ideally the option to filter columns based on this score would allow to set "conservation score" threshold for inclusion/exclusion and also allows to specify if this score is calculated from most abundant character in the column or a specific character (- or A or N...).

The application I have in mind is the custom filtering of bacterial whole genomes alignment to get core variant alignment which are commonly used to build phylogenic trees. In this particular case people usually remove all identical columns to only retain variant sites (eg. remove column with 100% conservation score) and filter all columns that contain one or more gap (-). The problem with this is this approach is the number of column remaining in the final core variants alignment is getting smaller an smaller when you add more genomes for various reasons (eg 1 genome has a deletion (-) at a site) and therefore the resolution of your phylogeny become quickly very bad when you have 1000s of genomes.

With the option I suggest one could fully control the way they filter columns to obtain the "core" variants alignment (eg remove invariant column with 100% identity score, retain position <1% of gap or deletion -, remove column with >2% ambiguous position (N)...).

There is currently no alignment processing tool that allow this level of filtering flexibility.

I hope my explanation make sense.

Romain

fredericlemoine commented 4 years ago

Hi Romain, I think I understand, thanks for the suggestion. I tried with the option --char to goalign clean sites.

WIth --char MAJ --cutoff 1, it will remove sites with 100% of the most abundant character at these positions (100% invariant sites).
With --char MAJ --cutoff 0.9, it will remove sites having >= 90% of the most abundant character at these positions.
With --char N --cutoff 0.1, it will remove sites having >= 10% N

You can try this test version under linux: goalign_amd64_linux.zip

Do not hesitate to tell me what you think about that option,

Frederic

jean-baka commented 4 years ago

Hi Fred, hi Romain,

But usually one wants to retain sites that are rather well conserved, not filter them out. There is a whole body of literature about alignment trimmers, some of the most famous being GBlocks, TrimAl and BMGE. Also, most of them work with sliding windows to smoothen the conservation signal, so that we don't keep one individual site inside a larger non-conserved/noisy region, etc.

So this can reach quite far. I would suggest that the most simple filtering scheme is to filter out sites having more than a certain percentage of gaps or N (if nucleotide alignment) or X (if a.a. alignment).

I'm not sure what you want to do with the identity of the most common character in a site. Doesn't matter too much, and making decisions based on the ratio of sequences having that character out of the total number of sequences doesn't work at least for amino acid sequences (e.g. a site comprising 1/3 I, 1/3 L and 1/3 V) is very well conserved because isoleucine, leucine and valine have very similar physico-chemical properties.

Cheers, Jean-Baka

On Sat, 2 May 2020, 23:30 fredericlemoine, notifications@github.com wrote:

Hi Romain, I think I understand, thanks for the suggestion. I tried with the option --char to goalign clean sites.

WIth --char MAJ --cutoff 1, it will remove sites with 100% of the most abundant character at these positions (100% invariant sites).

With --char MAJ --cutoff 0.9, it will remove sites having >= 90% of the most abundant character at these positions.

With --char N --cutoff 0.1, it will remove sites having >= 10% N

You can try this test version under linux: goalign_amd64_linux.zip https://github.com/evolbioinfo/goalign/files/4568782/goalign_amd64_linux.zip

Do not hesitate to tell me what you think about that option,

Frederic

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/evolbioinfo/goalign/issues/5#issuecomment-623009333, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA6M7S6V76KWRMHBSYE723RPR7GBANCNFSM4MWWENAA .

tseemann commented 4 years ago

I think @rguerillot and @jean-baka are both right, and there are different things one might to do, that are problem and data dependent.

For the the case of COVID-19 we have 1000s of consensus genomes, but most are patchy with NNN runs (low depth) and some gap - (true deletion or maybe 0 depth). Because this data is not high quality the signal to noise ratio is low and false SNPs distort phylogeny badly.

In the COVID case with 1000s of samples, removing a column (site) that is nearly invariant but not quite is a good idea, as the singleton/outlier is more likely to be error than noise, and if kept will produce artificially long branch lengths. For this case, as @jean-baka said, we want the opposite of --char MAJ --cutoff 0.9,

1

Do we need an --inverse / -v option to invert the filtering?

2

I like the idea of pseudo-character classed MAJ etc. It's elegant! Maybe MAJ1 MAJ2 .. MAJn for most to lowest frequent? I would like to be able to say "remove sites where MAJ2 (second most frequent allele) occurs < 3 times" (see 3 too)

3

And then character class MIN ? And MIN1 MIN2 ... MINn for least to most frequent? I think there are use cases for knowing the lowest freq allele too. (MAJn is the same thing but we don't know n)

4

Support fractions and absolute values Sometimes i want to discard a site if it has ONE (1) outlier. I don't want to have to calculate the number of rows first then convert to a proportion. ie. if --cutoff is 0 to 1.0 (yes 1.0) then treat as proportion if --cutoff is 1 (not 1.0) to INT_MAX then treat as absolute number

5

Do we need an option for case insensitive -i ?

fredericlemoine commented 4 years ago

Thanks @jean-baka and @tseemann for your interesting comments.

I indeed thought about alignment trimmers, I should have mentioned them in my first comment. And maybe even just for removing sites just based on number of gaps, proper alignment trimmers could be a better option.

In the case of COVID, there are cases where we want to remove mutations that are unique in their column, because they are most likely to be sequencing errors than real mutations (especially if there are several such columns). So I was not that surprised about that need. But indeed, it could necessitate more sophisticated methods.

For your suggestions @tseemann:

An inverse option would be interesting indeed. But I am not sure what you mean by the opposite of --char MAJ --cutoff 0.9. This option means that you remove sites where the most abundant character constitutes 90% or more of the site. Would the opposite be that you keep only such sites?
Very interesting ideas about classes of characters. I will think about it.
This is something I was planning to do, exactly for the case you mention of 1 occurrence.
The case insensitive option would be interesting, I started thinking about it while writing the documentation.

tseemann commented 4 years ago

Other character class examples would be:

DNA [AGTC]
DNAN [AGTCN]
IUPAC
any valid IUPAC code eg. R ie. count A + T

Maybe negative classes?

not DNA = --char ^DNA or similar

I've used trimal etc before but their focus seems to be on removing sequences, not columns. And trimal is very SLOW for the column operations.

You referred to entropy - i notice EASEL has operations that use that. it's also portentiually useful but for fine grained control (and ease of reviewer understanding) the clear cut (3 or less) or (10%) is easier to follow.

roblanf commented 4 years ago

Just chiming in for the COVID case here. I think an option to mask a singleton site (rather than mask/remove a column) might be useful.

The reason is as our COVID alignments grow, there's a monotonic increase in the proportion of sites with singletons. So just removing the entire site seems like we'll be removing more and more data as time goes on.

fredericlemoine commented 4 years ago

Just chiming in for the COVID case here. I think an option to mask a singleton site (rather than mask/remove a column) might be useful.

The reason is as our COVID alignments grow, there's a monotonic increase in the proportion of sites with singletons. So just removing the entire site seems like we'll be removing more and more data as time goes on.

I added the option --uniqueto goalign mask. It masks characters that are unique in their column (except gaps). If you try it (linux executable here) do not hesitate to tell me if it does not correspond to what yo were suggesting.

tseemann commented 4 years ago

@fredericlemoine i can only install tagged releases. I'll wait until you formalise it.

tseemann commented 4 years ago

@roblanf i would argue singleton columns will be less, or equally, likely as we sample more widely? the virus can only change so much, and we are sequencing all PCR+ve samples we get?

roblanf commented 4 years ago

Heh, good point. I guess it will be something like logistic growth in that case.

In a recent alignment of ~11K sequences, 24.5K of the 29.5K sites (~80%) were constant. So there's still a pretty big target of constant sites. If we really wanted to, we could downsample by date and plot the curve...

tseemann commented 1 year ago

@fredericlemoine Here I am 3 years later, and working on bacteria again, and encountering similar problems that I want to solve with goalign clean sites but not sure how to.

A common step in bacterial phylo is to only use the "core" sties. That is, sites (columns) which ONLY have A,C,G,T. This is overly strict, so prefer to do "fuzzy core" which allows some % of non-A,C,G,T letters.

For example, a 95% fuzzy core would only keep sites with >= 95% A,C,G,T, and allow for 5% to be GAP, N, X, or IUPAC say.

I think what I need is for goalign clean sites --inverse --char ACGT --cutoff 0.05

But there is no --inverse and no way for --char to take a SET or CLASS of characters.

Is there a better way to do this?

fredericlemoine commented 1 year ago

@tseemann , you may try the code after the last commit. It should work with --char ACGT (or any set of characters, e.g. --char AG-) and with --reverse.

fredericlemoine commented 1 year ago

@tseemann I just pushed a new docker image with the changes in case you don't want to build goalign :

docker pull evolbioinfo/goalign:dev

evolbioinfo / goalign

Option to filter sites/columns in alignment based on percentage of identity #5

1

2

3

4

5