cbg-ethz / cojac

GNU General Public License v3.0
18 stars 5 forks source link

issue with overlapping amplicons #7

Open hudenise2 opened 2 years ago

hudenise2 commented 2 years ago

I run cojac on wastewater samples which are sequenced using NimagenV3 primer amplicons. It worked well for most variant profile however I encountered this error while using a profile which has co-occurrence in amplicon 120: File "/usr/local/bin/cooc-mutbamscan2", line 245, in table[sample]=scanbam(alnfname, amplicons) File "/usr/local/bin/cooc-mutbamscan2", line 155, in scanbam amplicon_iter = alnfile.fetch(rq_chr, rq_b, rq_e) File "pysam/libcalignmentfile.pyx", line 1091, in pysam.libcalignmentfile.AlignmentFile.fetch File "pysam/libchtslib.pyx", line 688, in pysam.libchtslib.HTSFile.parse_region ValueError: invalid coordinates: start (22867) > stop (22846)

and here are the corresponding NimagenV3 insert coordinates: amplicon_119 [22595-22837] amplicon_120 [22804-22976] amplicon_121 [22876-23119]
amplicon_122 [23072-23288] According to the cooc-mutbamscan script, the boundaries are defined by +30 from previous amplicon end (22867) and -30 from next amplicon start (22846) which lead to this error. Could you implement a fix for it? Irene Bassano with you had contact with about cojac used to work with me. Thanks Hubert

ggrimes commented 2 years ago

I am also having the same error using the subArtic primers set.

MN908947.3 23193 23400 17 1 + MN908947.3 23290 23463 18 2 + MN908947.3 23408 23579 19 1 +

amplicon_18_EU  23430   23378   {23403: 'G'}    Traceback (most recent call last):
  File "/gpfs/igmmfs01/eddie/COVID-WW/scratch/cojac/cojac/./cooc-mutbamscan2", line 323, in <module>
    table[sample]=scanbam(alnfname, amplicons,rq_chr)
  File "/gpfs/igmmfs01/eddie/COVID-WW/scratch/cojac/cojac/./cooc-mutbamscan2", line 168, in scanbam
    amplicon_iter = alnfile.fetch(rq_chr, rq_b, rq_e)
  File "pysam/libcalignmentfile.pyx", line 1082, in pysam.libcalignmentfile.AlignmentFile.fetch
  File "pysam/libchtslib.pyx", line 688, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid coordinates: start (23430) > stop (23378)

If I change the cooc-mutbamscan (dev branch) script by removing the -30 +30 it runs, but I assume this would lead to incorrect results.

# make query start and query end for each amplicon
amp_bed["qstart"] = pd.concat([amp_bed["start"][0:1], amp_bed["stop"][:-1]]).reset_index(drop=True) + 0
amp_bed["qstop"] = pd.concat([amp_bed["start"][1:], amp_bed["stop"][-1:]]).reset_index(drop=True) - 0
ggrimes commented 2 years ago

the latest patch 247ab05 seems to fix the issue

DrYak commented 2 years ago

Yup, it looks like this commit manager to fix it.


A bit of background about this optimization/hack

In samtools' libhts and in pysam if you request between the start and end of a window, the library will return all the read-pairs in that partially overlap that window. In our case, if we submit a request spanning the whole amplicon's window, we will get all the read-pairs of that amplicon, BUT because the multiplex PCR are tiled, we would get also the read-pairs from neighbouring amplicons.

e.g.: (where b: some base and p: base of a primer)

bbbbbbbbbbppp----------------ppppbbbbbbbbb : amp 119 and 121
----pppbbbbbbbbbbbbbbbbbbbbbbbbbbbbppp---- : amp 120

       |--------------------------| : request

With such a request we would get all the (whole) reads-pairs of all 3 amplicons. We would need to then check for each read received if it actually covers the mutation position.

Whereas using the neighbouring amplicon start/stop and skipping -30/+30 to skip the primers:

bbbbbbbbbbppp----------------ppppbbbbbbbbb : amp 119 and 121
----pppbbbbbbbbbbbbbbbbbbbbbbbbbbbbppp---- : amp 120
         | + 30            30 - | : skip
               |---------| : request

The only read-pairs that overlap with the request are the ones from amplicon 120 We will get only (whole) read-pairs of that amplicon, so we don't need to filter-out reads that absolutely don't cover the positions.

But in Nimagen, the amplicon are so short that our usual -30/+30 overshoots and crosses. We fixed it (doing +5/-5 from the centre in that case).

bbbbbbbbbbppp----------------ppppbbbbbbbbb : amp 119 and 121
----pppbbbbbbbbbbbbbbbbbbbbbbbbbbbbppp---- : amp 120
                5 - | + 5  : skip
               |---------| : request

Caveat regarding amplicon lenght

source: doi:10.1101/2021.05.26.21257861

Other groups have shown that the optimal amplicon length for wastewater seem to be around ~400bp as done by, e.g., the Artic protocols. With the much shorter amplicons like Nimagen, you might have less opportunities for cooccurrences of mutations. (and with much longer amplicons, there might be fewer fragments as large that manage to survive through the wastewater all the way to the sequencing)

ggrimes commented 2 years ago

Thanks that makes it clear. Unfortunately, my amplicons are very short and this centering method would still retrieve PE reads that overlap multiple amplicons. Maybe I can create a custom insert bed file that has nonoverlapping amplicon regions. Or reduce the window to a single base.

DrYak commented 2 years ago

From the theory point of view:

BUT! in practice:

Since commit d505a5d31e643170fef9abcec42550c415980b22 before Christmas, there are new options to get this requests in a YAML file:

> ./cooc-mutbamscan --help
usage: cooc-mutbamscan [-h] [-s TSV | -a BAM/CRAM [BAM/CRAM ...]] [-/ [BATCHNAME]] [-p PATH] [-r REFID] [-m DIR] [-b BED] [-# COOC] [-Q YAML | -A YAML] [-j JSON] [-y YAML] [-t TSV] [-d]

scan amplicon (covered by long read pairs) for mutation cooccurrence

optional arguments:
  -h, --help            show this help message and exit
…
  -Q YAML, --in-amp YAML, --amplicons YAML
                        use the supplied YAML file to query amplicons instead of building it from BED + voc's DIR
  -A YAML, --out-amp YAML, --out-amplicons YAML
                        output amplicon query in a YAML file

use the --out-amplicons option and look at the 5th column in this file: it's a map (a dictionary) of all mutation visible on that position.

For example with Artic V4 and Omicron, this is what we get:

75_om: [22428, 22785, 22504, 22647, {22578: A, 22673: C, 22674: T, 22679: C}]
76_om: [22677, 23028, 22815, 22944, {22679: C, 22813: T, 22882: G, 22898: A, 22992: A,
    23013: C}]
77_om: [22974, 23327, 23058, 23216, {22992: A, 23013: C, 23040: G, 23048: A, 23055: G,
    23202: A}]
78_om: [23246, 23611, 23357, 23545, {23525: T, 23599: G}]
88_om: [26277, 26635, 26368, 26557, {26530: G, 26577: G}]

Two amplicons (78, 88) carry two mutation, the three other (75, 76, 77) carry many more.

Another separate thing to watch out for: beware that your amplicon's primer don't happen to map to a heavily mutated or a deleted part of your variant of interest, otherwise you're going to have drop out. (e.g.: the primer that correspond to Artic V3's amplicons 75 and 76 have trouble binding to omicron and thus we lose information there).