Closed ArtPoon closed 2 years ago
mm2 = minimap2_paired(args.fq1, args.fq2, ref=args.ref, nthread=args.thread, path=args.path)
for pairs in matchmaker(mm2):
for qname, rpos, cigar, seq in pairs:
diffs = encode_diffs(qname, rpos, cigar, seq)
print(diffs)
yields:
art@Wernstrom wastewater % python3 scripts/minimap2.py data/SRR15646856_1.fastq data/SRR15646856_2.fastq
('SRR15646856.7', [], [(0, 18913), (19014, 29903)])
('SRR15646856.7', [('~', 19159, 'T')], [(0, 19119), (19220, 29903)])
Tuples comprise qname
, diffs
and missing
:
qname
is the read labeldiffs
is a list of differences from the reference genome (substitutions or indels)missing
is a list of tuples spanning regions that are not covered by the read.We need to:
missing
lists to identify the overlap between the paired reads, if any.When we transition to working with amplicon-based data, we will have to adapt the above to store linkages among differences.
Working with tiled amplicon data makes tracking coverage a lot simpler, since reads can only map to one of a relatively small number of locations in the genome.
Use SRR14811414 from the Landgraff paper
art@Wernstrom wastewater % python3 scripts/minimap2.py data/21-WW-BAR-01-Illumina_R1_001.fastq.gz.1 data/21-WW-BAR-01-Illumina_R2_001.fastq.gz.1
('M04937:270:000000000-JFBNC:1:1101:15961:1337', [], [(0, 4026), (4201, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15961:1337', [], [(0, 4026), (4201, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15491:1362', [('~', 28271, 'G')], [(0, 28260), (28511, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15491:1362', [('~', 28271, 'G')], [(0, 28233), (28484, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15675:1395', [('~', 28271, 'G'), ('~', 28404, 'C')], [(0, 28241), (28492, 29903)])
('M04937:270:000000000-JFBNC:1:1101:15675:1395', [('~', 28269, 'C'), ('~', 28271, 'G')], [(0, 28264), (28515, 29903)])
Added back seq_utils.SC2Locator
to look up amino acid substitutions and indels.
Clarification of expected results:
diff = [
('A.1', [('~', 100, 'C')], [(0, 50), (200, 29903)]),
('A.1', [('~', 100, 'C'), (~', 53, 'A')], [(0, 50), (200, 29903)])
]
merge_diffs(diff)
('A.1', [('~', 100, 'C')], [(0, 50), (200, 29903)])
diff = [
('A.2', [('~', 100, 'C')], [(0, 60), (200, 29903)]),
('A.2', [('~', 100, 'C'), (~', 53, 'A')], [(0, 50), (180, 29903)])
]
merge_diffs(diff)
('A.2', [('~', 100, 'C')], [(0, 60), (180, 29903)])
Alternatively, we could express the coverage contribution with weights, e.g. [(0, 50), (51, 60), (61, 180), (181, 200), (201, 29903)] in the second situation, where we assume that the first entry has a weight of 0, second is 0.5, third is 1 (full coverage), fourth is 0.5, and fifth is 0.
diff = [ # same as before
('A.2', [('~', 100, 'C')], [(0, 60), (200, 29903)]),
('A.2', [('~', 100, 'C'), (~', 53, 'A')], [(0, 50), (180, 29903)])
]
merge_diffs(diff)
('A.2', [('~', 100, 'C'), (~', 53, 'A')], [(0, 50), (200, 29903)])
diff = [
('A.3', [('~', 90, 'C')], [(0, 60), (100, 29903)]),
('A.3', [('~', 110, 'C'), (~', 153, 'A')], [(0, 105), (180, 29903)])
]
merge_diffs(diff)
('A.3', [('~', 90, 'C')], [(0, 60), (100, 29903)])
('A.3', [('~', 110, 'C'), (~', 153, 'A')], [(0, 105), (180, 29903)])
diff = [
('A.3', [('~', 90, 'C')], [(0, 60), (100, 29903)]),
('A.3', [('~', 110, 'C'), (~', 153, 'A')], [(0, 105), (180, 29903)])
]
merge_diffs(diff)
('A.3', [('~', 90, 'C'), ('~', 110, 'C'), (~', 153, 'A')], [(0, 60), (101, 104), (180, 29903)])
In this situation, the non-overlapping pairs are treated as if they're separate except they only add 1 to the denominator.
For paired reads that have a mutation and a censored region, we can basically either assume that the reads would have disagreed (option 1s) or agreed (option 2s).
As a secret third option, we could choose each of the options with 50% chance since we have no information either way.
I added a new function
minimap2_paired
tominimap.py
that uses the-ax sr
option to take two input files (i.e., R1 and R2).