Closed esteinig closed 1 year ago
Actually this is fine as is, since we are iterating using a standard BufReader
, parsing each line into PafRecord
. What provides a memory buff is not storing the records in PafAlignment
as they are actually never used later except to compute intervals. Instead, intervals are computed from records on the fly while parsing the alignment file.
This provides a slight speedup and ~ 60% reduction in peak memory use for the 'large' test alignment from a deep sequencing run which consists of 83 million reads with 63 million alignments against a 70k reference sequence database (~ 8GB PAF). Vircov
completes in ~ 55 seconds using 3GB RAM for this test case.
On a smaller test alignment from a similar run that produced 1.7 million alignments against 70k reference sequences, vircov
completes in < 1 second with 12MB peak memory usage (~ 215MB PAF).
--> Restructure PAF filter tests
Maybe if the PafRecord
is minimised to only the fields needed this will parsing speed?
PafMiniRecord
provides a small speedup: 0.85s --> 0.78s (small test dataset)
Still fairly slow on large alignments, checking out csv_reader
branch to see if standard (optimized) reader crate can be used instead.
rust-csv
with serde
deserialization is actually slightly slower (< 1 second) on the large alignment but slightly faster (0.58 vs 0.72 seconds) than a custom from_str
method.
I think implementing the csv reader may be better from a error handling / design point of view
Using byte arrays instead of Strings might be better for memory use
Edit: can't deserialize into borrowed byte arrays, will revisit at a later stage
I think I need to reimplement a custom PAF reader as the csv
crate expects the same number of columns across the whole file. However when using -c
option in minimap2
a variable number of CIGAR tags can be output e.g.
NB551233:336:HGVJ5AFX3:1:11102:13098:1580 149 0 128 - acc|GENBANK|AF411058.1|Homo 181654 45399 45524 111 128 1 NM:i:17 ms:i:104 AS:i:100 nn:i:0 tp:A:P cm:i:1 s1:i:41 s2:i:0 de:f:0.1190 rl:i:0 cg:Z:42M3I83M
and ** marked tag zd:i:1
NB551233:336:HGVJ5AFX3:1:11102:13098:1580 149 0 34 + acc|GENBANK|AF411058.1|Homo 181654 135720 135754 32 34 6 NM:i:2 ms:i:56 AS:i:56 nn:i:0 tp:A:P cm:i:1 s1:i:14 s2:i:0 de:f:0.0588 **zd:i:1** rl:i:0 cg:Z:34M
Good enough for now
Need to implement an iterator based PAF parser and link it to interval extraction directly in the
PafRecord
structs - large PAF files could probably be parsed faster this way.