esteinig / vircov

Viral genome coverage evaluation for metagenomic diagnostics :drop_of_blood:
Apache License 2.0
27 stars 4 forks source link

Performance optimisation [PAF] #8

Closed esteinig closed 1 year ago

esteinig commented 2 years ago

Need to implement an iterator based PAF parser and link it to interval extraction directly in the PafRecord structs - large PAF files could probably be parsed faster this way.

esteinig commented 2 years ago

Actually this is fine as is, since we are iterating using a standard BufReader, parsing each line into PafRecord. What provides a memory buff is not storing the records in PafAlignment as they are actually never used later except to compute intervals. Instead, intervals are computed from records on the fly while parsing the alignment file.

This provides a slight speedup and ~ 60% reduction in peak memory use for the 'large' test alignment from a deep sequencing run which consists of 83 million reads with 63 million alignments against a 70k reference sequence database (~ 8GB PAF). Vircov completes in ~ 55 seconds using 3GB RAM for this test case.

esteinig commented 2 years ago

On a smaller test alignment from a similar run that produced 1.7 million alignments against 70k reference sequences, vircov completes in < 1 second with 12MB peak memory usage (~ 215MB PAF).

esteinig commented 2 years ago

--> Restructure PAF filter tests

esteinig commented 2 years ago

Maybe if the PafRecord is minimised to only the fields needed this will parsing speed?

esteinig commented 2 years ago

PafMiniRecord provides a small speedup: 0.85s --> 0.78s (small test dataset)

esteinig commented 2 years ago

Still fairly slow on large alignments, checking out csv_reader branch to see if standard (optimized) reader crate can be used instead.

esteinig commented 2 years ago

rust-csv with serde deserialization is actually slightly slower (< 1 second) on the large alignment but slightly faster (0.58 vs 0.72 seconds) than a custom from_str method.

I think implementing the csv reader may be better from a error handling / design point of view

esteinig commented 2 years ago

Using byte arrays instead of Strings might be better for memory use

Edit: can't deserialize into borrowed byte arrays, will revisit at a later stage

esteinig commented 2 years ago

I think I need to reimplement a custom PAF reader as the csv crate expects the same number of columns across the whole file. However when using -c option in minimap2 a variable number of CIGAR tags can be output e.g.

NB551233:336:HGVJ5AFX3:1:11102:13098:1580   149 0   128 -   acc|GENBANK|AF411058.1|Homo 181654  45399   45524   111 128 1   NM:i:17 ms:i:104    AS:i:100    nn:i:0  tp:A:P  cm:i:1  s1:i:41 s2:i:0  de:f:0.1190 rl:i:0  cg:Z:42M3I83M

and ** marked tag zd:i:1

NB551233:336:HGVJ5AFX3:1:11102:13098:1580   149 0   34  +   acc|GENBANK|AF411058.1|Homo 181654  135720  135754  32  34  6   NM:i:2  ms:i:56 AS:i:56 nn:i:0  tp:A:P  cm:i:1  s1:i:14 s2:i:0  de:f:0.0588 **zd:i:1**  rl:i:0  cg:Z:34M
esteinig commented 1 year ago

Good enough for now