NVIDIA-Genomics-Research / GenomeWorks

SDK for GPU accelerated genome assembly and analysis
https://clara-parabricks.github.io/GenomeWorks/
Apache License 2.0
286 stars 76 forks source link

[pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files #571

Open edawson opened 4 years ago

edawson commented 4 years ago

Despite updating the evaluate_paf script to handle queries better, the performance of the script is inadequate for large-scale CI jobs.

One solution to this is to ditch the interval tree data structure and instead rely on sorted PAF input. For large PAF files, this may still take a significant amount of time, though it should significantly reduce the memory usage (requiring only two PAF records to be kept in memory at a time; currently, all truth set records are maintained in memory).

Another option would be to provide random access to bgzipped PAF files, either through TABIX or some other API.