Adapt minimap2.py script to process paired-end read data

ArtPoon commented 2 years ago

I added a new function minimap2_paired to minimap.py that uses the -ax sr option to take two input files (i.e., R1 and R2).

ArtPoon commented 2 years ago

    mm2 = minimap2_paired(args.fq1, args.fq2, ref=args.ref, nthread=args.thread, path=args.path)
    for pairs in matchmaker(mm2):
        for qname, rpos, cigar, seq in pairs:
            diffs = encode_diffs(qname, rpos, cigar, seq)
            print(diffs)

yields:

art@Wernstrom wastewater % python3 scripts/minimap2.py data/SRR15646856_1.fastq data/SRR15646856_2.fastq 
('SRR15646856.7', [], [(0, 18913), (19014, 29903)])
('SRR15646856.7', [('~', 19159, 'T')], [(0, 19119), (19220, 29903)])

ArtPoon commented 2 years ago

Tuples comprise qname, diffs and missing:

qname is the read label
diffs is a list of differences from the reference genome (substitutions or indels)
missing is a list of tuples spanning regions that are not covered by the read.

We need to:

[ ] write a function that takes the two missing lists to identify the overlap between the paired reads, if any.
[ ] identify differences within the overlapping region that are discordant between reads, i.e., present in one but not the other. These differences should be excluded.
[ ] track the frequencies (counts) of all differences in merged reads (numerator)
[ ] track genome positions that are covered by the merged read pair to calculate overall coverage (number of merged reads per position) (denominator).

ArtPoon commented 2 years ago

When we transition to working with amplicon-based data, we will have to adapt the above to store linkages among differences.

ArtPoon commented 2 years ago

Working with tiled amplicon data makes tracking coverage a lot simpler, since reads can only map to one of a relatively small number of locations in the genome.

ArtPoon commented 2 years ago

Use SRR14811414 from the Landgraff paper

ArtPoon commented 2 years ago

art@Wernstrom wastewater % python3 scripts/minimap2.py data/21-WW-BAR-01-Illumina_R1_001.fastq.gz.1 data/21-WW-BAR-01-Illumina_R2_001.fastq.gz.1
('M04937:270:000000000-JFBNC:1:1101:15961:1337', [], [(0, 4026), (4201, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15961:1337', [], [(0, 4026), (4201, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15491:1362', [('~', 28271, 'G')], [(0, 28260), (28511, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15491:1362', [('~', 28271, 'G')], [(0, 28233), (28484, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15675:1395', [('~', 28271, 'G'), ('~', 28404, 'C')], [(0, 28241), (28492, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15675:1395', [('~', 28269, 'C'), ('~', 28271, 'G')], [(0, 28264), (28515, 29903)])

ArtPoon commented 2 years ago

Added back seq_utils.SC2Locator to look up amino acid substitutions and indels.

DBecker7 commented 2 years ago

Clarification of expected results:

Perfect overlap

Only return mutation if it's in both tuples.

diff = [
    ('A.1', [('~', 100, 'C')], [(0, 50), (200, 29903)]),
    ('A.1', [('~', 100, 'C'), (~', 53, 'A')], [(0, 50), (200, 29903)])
]

merge_diffs(diff) 
('A.1', [('~', 100, 'C')], [(0, 50), (200, 29903)])

Overlapping (not perfect)

Option 1: Only return if in both, coverage is the intersection.

diff = [
    ('A.2', [('~', 100, 'C')], [(0, 60), (200, 29903)]),
    ('A.2', [('~', 100, 'C'), (~', 53, 'A')], [(0, 50), (180, 29903)])
]

merge_diffs(diff) 
('A.2', [('~', 100, 'C')], [(0, 60), (180, 29903)])

Alternatively, we could express the coverage contribution with weights, e.g. [(0, 50), (51, 60), (61, 180), (181, 200), (201, 29903)] in the second situation, where we assume that the first entry has a weight of 0, second is 0.5, third is 1 (full coverage), fourth is 0.5, and fifth is 0.

Option 2: Return if it's in both reads or in one read and censored in the other, coverage is the union.

diff = [ # same as before
    ('A.2', [('~', 100, 'C')], [(0, 60), (200, 29903)]),
    ('A.2', [('~', 100, 'C'), (~', 53, 'A')], [(0, 50), (180, 29903)])
]

merge_diffs(diff) 
('A.2', [('~', 100, 'C'), (~', 53, 'A')], [(0, 50), (200, 29903)])

Non-overlapping

Option 1: Treat them as separate reads

diff = [ 
    ('A.3', [('~', 90, 'C')], [(0, 60), (100, 29903)]),
    ('A.3', [('~', 110, 'C'), (~', 153, 'A')], [(0, 105), (180, 29903)])
]

merge_diffs(diff) 
('A.3', [('~', 90, 'C')], [(0, 60), (100, 29903)])
('A.3', [('~', 110, 'C'), (~', 153, 'A')], [(0, 105), (180, 29903)])

Option 2: Return all mutations and the (discontinuous) union of coverage?

diff = [ 
    ('A.3', [('~', 90, 'C')], [(0, 60), (100, 29903)]),
    ('A.3', [('~', 110, 'C'), (~', 153, 'A')], [(0, 105), (180, 29903)])
]

merge_diffs(diff) 
('A.3', [('~', 90, 'C'), ('~', 110, 'C'), (~', 153, 'A')], [(0, 60), (101, 104), (180, 29903)])

In this situation, the non-overlapping pairs are treated as if they're separate except they only add 1 to the denominator.

For paired reads that have a mutation and a censored region, we can basically either assume that the reads would have disagreed (option 1s) or agreed (option 2s).

As a secret third option, we could choose each of the options with 50% chance since we have no information either way.

ArtPoon commented 2 years ago

overlapping (not perfect) - Option 2: Return if it's in both reads or in one read and censored in the other, coverage is the union
non-overlapping - Option 2: Return all mutations and the (discontinuous) union of coverage

PoonLab / gromstole