andersen-lab / bjorn

GNU General Public License v3.0
20 stars 4 forks source link

Store sequences with Ns in positions with mutations #20

Open gkarthik opened 3 years ago

gkarthik commented 3 years ago

Currently, when mutation prevalence is calculated, we don't account for sequences with Ns in those positions. While this is not a problem for older lineages like B.1.1.7, this could be an issue while showing characteristic mutations for new lineages with few sequences such as AY.1 and AY.2.

In order to account for Ns, we would have to store sequences with Ns in sites that have a mutation. This could increase the size of the database but its unclear how much of an increase that would be.

This will allow us to compute mutation prevalence in the API by excluding the sequences with Ns.

willfischer commented 3 years ago

The idea is to use the "non-N" counts at each position to determine frequencies, right? Depending on how the data are stored, it might be possible to store for each mutation position the count of non-N bases (by lineage), and use that.

gkarthik commented 3 years ago

Hey @willfischer , since we allow users to dynamically subset sequences by mutations, location etc it is not trivial to store "non-N" counts for all possible combinations but we should be able to store it by sequence. More testing to come.

willfischer commented 3 years ago

Hi Karthik,

Probably I am missing something. But you shouldn’t have to store non-N counts by position.

Here is my imagined view of your data, and a naive approach to the problem (based on my possibly incorrect assumptions). I assume you folks store translated amino acids for each position by sequence, and then compute frequencies on the fly using the called mutations divided by the total number of sequences in the subset. If this is correct, the simple approach is to add the amino acid ‘X’ where there isn’t a translatable codon.

Then when computing the frequencies, instead of this (using made-up variable names for clarity):

current approach? Gives wrong answer because “seqs_in_subset” includes the count of all the sequences

Freq_of_AA_at_position = count_of_AA_at_position / seqs_in_subset

use this:

new approach — corrects for untranslatable codons

Freq_of_AA_at_position = count_of_AA_at_position / (seqs_in_subset - count_of_X_at_position)

Of course, the reporting code would need to exclude “Xs”.

If I’ve misunderstood how you’re doing things, I apologize. Alternatively, if I’m belaboring the obvious, forgive me — but this seems (from my distant vantage point) like a relatively simple way to approach the problem.

Best regards,

— Will


Will Fischer

net: @.**@.> cell: 505-577-0592

Group T-6, Theoretical Biology Los Alamos National Laboratory Los Alamos, NM 87505 USA

On Aug 1, 2021, at 7:37 PM, Karthik @.**@.>> wrote:

Hey @willfischerhttps://github.com/willfischer , since we allow users to dynamically subset sequences by mutations, location etc it is not trivial to store "non-N" counts for all possible combinations but we should be able to store it by sequence. More testing to come.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/andersen-lab/bjorn/issues/20#issuecomment-890647846, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB2HLA3LX6IPXZ6ARYPICULT2XZHDANCNFSM5AMN3UTA.

willfischer commented 3 years ago

Thanks Will, for the note I’m also puzzled.

I was glad to find the footnote on you web page about the S-gene mutations, thank you.

But I’m still worried about the issue. I know people have been using your Spike mutation lists to design Spike reagents, but I don’t know if it is many or just a few.

The issue that concerns me is that we have been hearing conflicting reports of neutralization sensitivity of Delta viruses, and they are usually just described as “Delta”, with no mutation list, by folks in the immunology community.

The virologist and immunologists have been trying to figure out why they get such different results for Delta, and consider pseudotype vectors, cells types, in vivo vs in vitro. But I think at this point few appreciate that there might be using different sequences of Delta they are using, and it is in fact another variable.

I have no idea if the G is important, but I also think we can’t assume that isn’t.

Thank you for working hard on this, good luck.

My intention here is only for us to get a moving target captured as well as we can, its a struggle for all of us, let's keep communicating and keep trying to help each other sort it out.

Best Regards,

bette

On Aug 2, 2021, at 9:24 AM, Fischer, William Mclean @.***> wrote:

Hi Karthik,

Probably I am missing something. But you shouldn’t have to store non-N counts by position.

Here is my imagined view of your data, and a naive approach to the problem (based on my possibly incorrect assumptions). I assume you folks store translated amino acids for each position by sequence, and then compute frequencies on the fly using the called mutations divided by the total number of sequences in the subset. If this is correct, the simple approach is to add the amino acid ‘X’ where there isn’t a translatable codon.

Then when computing the frequencies, instead of this (using made-up variable names for clarity):

current approach? Gives wrong answer because “seqs_in_subset” includes the count of all the sequences

Freq_of_AA_at_position = count_of_AA_at_position / seqs_in_subset

use this:

new approach — corrects for untranslatable codons

Freq_of_AA_at_position = count_of_AA_at_position / (seqs_in_subset - count_of_X_at_position)

Of course, the reporting code would need to exclude “Xs”.

If I’ve misunderstood how you’re doing things, I apologize. Alternatively, if I’m belaboring the obvious, forgive me — but this seems (from my distant vantage point) like a relatively simple way to approach the problem.

Best regards,

— Will


Will Fischer

net: @. @.> cell: 505-577-0592

Group T-6, Theoretical Biology Los Alamos National Laboratory Los Alamos, NM 87505 USA

On Aug 1, 2021, at 7:37 PM, Karthik @. @.>> wrote:

Hey @willfischer https://github.com/willfischer , since we allow users to dynamically subset sequences by mutations, location etc it is not trivial to store "non-N" counts for all possible combinations but we should be able to store it by sequence. More testing to come.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/andersen-lab/bjorn/issues/20#issuecomment-890647846, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2HLA3LX6IPXZ6ARYPICULT2XZHDANCNFSM5AMN3UTA.

gkarthik commented 3 years ago

Hello Will and Bette,

Will, the approach you detailed of storing X is indeed what I originally proposed to do in the issue (albeit at the nucleotide level which will result in X after translation). However, with over 2 million sequences, we need to consider how large the data might become. This requires testing on our end.

Bette, the first I've heard of people using the lists for reagent design is from you folks. It's unclear to me how common that is. We have added disclaimers here and at the bottom of each section stating that for reagent design we recommend, at the very least, downloading the consensus sequences from GISAID.

Additionally, in case you missed the email thread, the issue with S:E156G has been resolved (please see https://github.com/andersen-lab/bjorn/pull/24). We now capture mutations caused by out-of-frame deletions for ALL lineages including B.1.617.2 and C.37.

As you said, this is a moving target and I suspect we will have to continuously keep up to capture the most relevant mutations in circulating lineages.

Best, Karthik

willfischer commented 3 years ago

I see — you’re storing nucleotides. Sure, allowing IUPAC codes expands the data from 4 states to 15, and the number of codons from 64 to 3375(!).

You can of course get fancy, record all the ambiguities, and build up a full ambiguity-conscious translation table … maybe it’s worth it for 3rd-position Ns for the fourfold codons, and 3rd position R and Y for the two-fold codons, and ATH for isoleucine. But that way madness lies.

But in this context the most reasonable approach is probably just to transform all the ambiguous IUPAC bases (K Y W S R M B D H V N) to ’N’, store them that way, and be done with it. You can add TCN, CTN, CCN, CGN, ACN, GTN, GCN, GGN to your translation table to catch a few more AA, and translate any other codon that has an ’N’ in any position to ‘X’.

It’s crazy how, after some years of having storage and ram to burn, we have to think carefully about this stuff again.

All the best,

— Will

On Aug 2, 2021, at 10:52 AM, Karthik @.**@.>> wrote:

Hello Will and Bette,

Will, the approach you detailed of storing X is indeed what I originally proposed to do in the issue (albeit at the nucleotide level which will result in X after translation). However, with over 2 million sequences, we need to consider how large the data might become. This requires testing on our end.

Bette, the first I've heard of people using the lists for reagent design is from you folks. It's unclear to me how common that is. We have added disclaimers herehttps://outbreak.info/situation-reports/methods#characteristic and at the bottom of each section stating that for reagent design we recommend, at the very least, downloading the consensus sequences from GISAID.

Additionally, in case you missed the email thread, the issue with S:E156G has been resolved (please see #24https://github.com/andersen-lab/bjorn/pull/24). We now capture mutations caused by out-of-frame deletions for ALL lineages including B.1.617.2 and C.37.

As you said, this is a moving target and I suspect we will have to continuously keep up to capture the most relevant mutations in circulating lineages.

Best, Karthik

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/andersen-lab/bjorn/issues/20#issuecomment-891176825, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB2HLA4WTQMWD3W4TSZSQOTT23ELTANCNFSM5AMN3UTA.

AlaaALatif commented 3 years ago

Hi Will and Karthik,

Will -- thank you for sharing your proposed solution in a well-detailed manner. Although the solution would certainly solve the issue, I believe it may not be feasible given the computational context.

A core part of the algorithm for identifying mutations is that the steps for identification, translation, and all subsequent downstream processing happens only at points (nucleotide positions) where there are mismatches between any given sequence and the reference. This helps facilitate the ability to process sequencing data at a global scale and in near real-time manner.

One realization of the proposed solution would be to process all ambiguous base calls. Assuming an average coverage of 80% for any given sequence, this would roughly mean an additional ~2000 computational processes per sequence. While currently, it only needs to perform about an average of 50 processes for any given sequence (*). This would certainly lead to considerable increase in time and memory consumption, but testing is necessary to better understand how feasible this would be.

An alternative realization would be to "only check positions in the sequence that matter". This would entail the need to group each sequence based on a subset of mutations it shares with other sequences. But I believe this would compromise the current ability to perform effective parrallel processing (by treating each sequence independently).

This is an important feature to implement, and it would lead to an improved understanding of the prevalence of mutations. Discussions exactly like these can help us arrive at an effective solution.

I hope I understood this discussion correctly and that my response is clear.

Cheers, Al

(*) I used assumptions of common averages for coverage and mutation rate for these estimates.

gkarthik commented 3 years ago

Hey Al, thanks for weighing in. I agree that the first approach can be parallelized easier. That was what I had in mind too. We do need to run tests on this to see what the increase in run time and database size would be.

mindoftea commented 2 years ago

One potentially viable solution to this problem might be to assume that sequences with ambiguous mutation sites contain those mutations with the same probabilities as their previously-scanned lineage-mates. As it processes sequences in a given lineage, Bjorn can accumulate a table of mutations observed to occur within that lineage with probability p > epsilon (with epsilon on the order of 1%). Then when a new sequence is scanned, Bjorn can compare any patches of Ns it might contain against its lineage's mutation table; when the occurrence of a mutation can't be ruled out (that is, the site is covered by Ns), we can mark the sequence as 'maybe' having that mutation, with probability p.

Downstream, we would use two queries to recover the unambiguous mutation prevalence: one (X) excluding sequences where the mutation is marked with 'maybe', and one (Y) including only those sequences. Taking X / (T - Y), where T is the total number of sequences in the lineage, would then yield an estimate of the prevalence of the mutation if no sequences containing Ns at that site been scanned.

This problem could also be generalized to interpolating a binomial distribution over counts of mutation-positive sequences within a lineage -- given X, Y, T, and p from above, we can estimate the probability mass function P of the true positive count K as P(K) = C(Y, K - X) p^(K-X) (1-p)^(X+Y-K). This would allow us to give mutation prevalence estimates as confidence intervals on outbreak.