Ambiguous_content - Githubissues

cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.

GNU General Public License v3.0

426 stars 108 forks source link

Ambiguous_content #465

Closed antarapaul02 closed 2 years ago

antarapaul02 commented 2 years ago

In pangolin 4.0.6, the column qc_note has been introduced into the lineage report. I wanted to clarify if the term 'Ambiguous_content' in the qc note column is as same as 'N_content'.

aineniamh commented 2 years ago

It means the amount of content that's not an A,T,C,G or - character. The reason for the shift is that this accounts for the amount of ambiguous characters relative to a complete genome. So if you had a small fragment that was 100% complete, but only 5% of the genome, you'd still have 95% ambiguous content. Hope that helps clarify!

gadepallivs commented 8 months ago

Hi, I'm using the pangolin command line ( pangolin 4.3 & pangolin-data v1.23.1). Pangolin default QC check includes length (10K bases), and maximum N content (50%). When fasta sequences with N content (50% or more) have a lineage assigned, how do we trust that lineage assignment? is there any additional info from pangolin output we should look for to get confidence on the lineage for sequences with large N contents?

AngieHinrichs commented 8 months ago

When a sequence has a lot of Ns, of course confidence in lineage assignment is lower, but it also matters where the N's are, i.e. whether the Ns are in positions that distinguish between different lineages. A sequence can still be placed in a phylogenetic tree using a subset of available positions/mutations, although when there are more Ns, there may be more equally-scoring placements within the tree, often meaning different lineages could be assigned.

The lineage_report.csv output has a final column note that indicates whether there was a unique placement in the tree or multiple equally parsimonious placements, and whether those placements were in the same lineage or different lineages. Most sequences have a unique placement and the note column has a value like this, with a lineage and "(1/1)" meaning that one out of one placement was in that lineage:

Usher placements: JN.1(1/1)

But a sequence with more Ns at positions that distinguish between related lineages might have (for example) five equally parsimonious placements, with two placements in EG.5.1, two in EG.5.1.1 and one in XBB.1.22, encoded like this:

Usher placements: EG.5.1(2/5) EG.5.1.1(2/5) XBB.1.22(1/5)

The third column, conflict, contains 0 for unique placements, and a higher value (number of placements with a lineage different from the lineage reported in the second column lineage, divided by the total number of placements).

If you really want to dig into where the Ns and lineage-distinguishing mutations are, then I suppose you could align your sequence and lineage consensus sequences from https://github.com/corneliusroemer/pango-sequences to the reference, and use a multiple alignment viewer or snipit to visualize the mutations in your sequence vs. the mutations in lineages of usher placements.

whottel commented 7 months ago

In the later case as mentioned by AngieHinrichs where there are multiple equally parsimonious placements with Usher. In the “lineage” column of the output there is still a single lineage listed. Looking at the definition of the “lineage” column said here: https://cov-lineages.org/resources/pangolin/output.html. I take it that any Usher ambiguities are resolved by checking mutations present in the query sequence against those listed here: https://github.com/cov-lineages/pango-designation/blob/master/lineage_notes.txt. Is this the correct way to think about how this is working? Let’s say based on the example above it was XBB.1.22 that was listed as the “lineage”. Would it be fair to say this determination was based on the presence of “S:486P, on 28297C branch” for this sequence?

Thanks, Wes

AngieHinrichs commented 7 months ago

Hi Wes - lineage_notes.txt does not always list all of the mutations that distinguish the lineage from its parent lineage, only the ones of most interest to the designator. For example, if a new lineage has one nucleotide mutation that causes a Spike mutation after some nucleotide mutations that don't change any amino acid change, then the Spike mutation will definitely be listed in lineage_notes.txt, while the non-coding mutations may or may not be mentioned.

Pangolin starts by aligning your input sequence to a lineage A reference sequence using minimap2, and then masking the untranslated regions at the beginning and end of the genome. Then usher uses the nucleotide substitutions in the aligned and masked sequence relative to the reference, and the N or ambiguous bases, to search in a tree of ~100,000 consensus sequences, lineageTree.pb. That tree is distilled (once per pangolin-data release) from the UCSC tree with 16 million sequences (you can search your sequences in the full tree using https://usher.bio/), by randomly selecting 50 sequences from each lineage's branch and then removing private mutations, to get a pruned-down branch that represents some of the diversity within the lineage. So even if lineage_notes.txt did list all the mutations, the search is a little more extensive than that.

When there are multiple equally parsimonious placements (EPPs), I believe usher uses the number of descendants of each node as a tie-breaker. If the EPPs are for a parent lineage and a child lineage, that will favor the parent lineage because it will have more descendant sequences than the child. Otherwise, given the way lineageTree.pb is constructed, the lineage with more sublineages (or more diverse sequences) should win.

Are you seeing any results that seem to disagree with lineage_notes.txt?

whottel commented 7 months ago

Thanks for providing that explanation of how pangolin is using usher. In your explanation of tiebreakers for EPPs you mentioned that the parent lineage would be favored. Recently I had two assemblies with EPPs (.fasta converted to .txt to link here EPP_Check.txt lineage_report.csv ). One was called as BQ.1.1.47 and the other JN.1. In both cases the parent lineage, BQ.1.1 and BA.2.86.1 respectively, were listed as an equally likely usher placement. In these cases why would the child lineage be favored?

Please forgive the tangent, but would it be fair to say that first, prior to usher, pangolin attempts lineage assignment by sequence hash similarity to a set of designated genomes per lineage resulting in the “Assigned from designation hash” note if there is a match?

Not with any recent samples, but with sequencing historical samples with two different methods we found that with sequences generated under one of the methods there tended to be a stretch of internal N’s in ORF1( hash_usher_comparison.txt ). The lineage calls for these were not by designation hash, but rather by usher as B.1.369. Their counterparts without internal N’s were called as B.1.564 by designation hash. ( lineage_report.csv ). I am not sure exactly what is the difference between these two lineages, as the lineage_notes.txt only says “split out from B.1.369”. Based on your earlier explanation there could be non-coding mutations that just are not listed there. I also went into the lineages_hash_list and grabbed 10 of the B.1.564 assemblies from GISAID and ran these through the usher web tool ( B.1.564_hash_list_subset.txt Usher_output_screenshot ). I was surprised to see that the majority of the 10 I selected were not grouped within B.1.564 by usher. In fact, it looks like more complete assemblies tended to be grouped with B.1.369. Is there an actual biological difference between B.1.564 and B.1.369 or is the distinction more of an artifact of assembly completeness?

AngieHinrichs commented 7 months ago

Thanks for sharing the sequences @whottel. I was mistaken that the larger/parental lineage should generally win -- I forgot that "BQ.1.1" might mean that the sample was placed on a small terminal branch within the larger BQ.1.1 branch, not necessarily near the root of BQ.1.1. If the sample is also placed at or near the root of BQ.1.1.47, then the BQ.1.1.47 node could have more descendants than a more terminal node within BQ.1.1. If the sample had EPPs at the root of BQ.1.1 and the root of BQ.1.1.47, then I would still expect BQ.1.1 to win. But in this case, running usher locally on the lineageTree.pb file and your sequences (aligned and masked by pangolin), now I see that the BQ.1.1 placement was on a tiny branch (BQ.1.1 > A10323G) that has only one descendant in the pruned-down lineageTree.pb, while the BQ.1.1.47 placement was at the root of BQ.1.1.47 (BQ.1.1 > A22002T) which has multiple descendants. In any of the three potential placements, the sequence had 20 private mutations, which is a lot more than the difference between BQ.1.1 and BQ.1.1.47. It's just hard to place that sequence confidently.

would it be fair to say that first, prior to usher, pangolin attempts lineage assignment by sequence hash similarity to a set of designated genomes per lineage resulting in the “Assigned from designation hash” note if there is a match?

Yes. (The hash is computed on the aligned and masked sequence, so differences in the masked untranslated regions 1-264 and 29675-29903 are ignored.) If a match is found for the hash, then pangolin uses the designated lineage and does not run usher on the sequence; lineage_report.csv will have PANGO-v1.25.1 instead of PUSHER-1.25.1 in the version (9th) column. You can disable that behavior with the option --skip-designation-cache.

Ah, B.1.369 and B.1.564 -- most of their designated sequences land on the same branch of the UShER tree, and our annotation algorithm assigned B.1.369 to that branch and B.1.564 to its parent node. Lineages designated before mid-2021 used alternate tree building methods and sometimes resulted in designations that are non-monophyletic and/or overlapping in the UShER tree. Also, before mid-2021, the Pango team in general avoided listing specific mutations for lineages. I will send a follow-up message with more about B.1.369 and B.1.564.

AngieHinrichs commented 7 months ago

Here is a taxonium view of the B.1.564 / B.1.369 branch of the UCSC UShER tree, with red circles around 50 randomly sampled designated sequences for B.1.369:

and here is the same branch, with red circles around 50 randomly sampled designated sequences for B.1.564:

As you can see, the designated sequences for both lineages are spread across pretty much the same branch, although maybe focused on different child branches within that branch. Both lineages have sublineages within that branch.

It would take more time to dig into the particulars of what happened there (i.e. why the lineages looked distinct on the earlier Pango trees but not on the UShER tree), but one thing that I've seen cause similar troubles in the past is untrimmed nanopore adapters causing artifactual "mutations" that are masked out when building the UShER tree (as part of the Problematic Sites set) but were not masked out in the earlier Pango trees. That's not necessarily what happened here, just something I've seen before.

whottel commented 7 months ago

Thanks so much for looking into this and providing that great explanation!