Despite updating the evaluate_paf script to handle queries better, the performance of the script is inadequate for large-scale CI jobs.
One solution to this is to ditch the interval tree data structure and instead rely on sorted PAF input. For large PAF files, this may still take a significant amount of time, though it should significantly reduce the memory usage (requiring only two PAF records to be kept in memory at a time; currently, all truth set records are maintained in memory).
Another option would be to provide random access to bgzipped PAF files, either through TABIX or some other API.
Despite updating the evaluate_paf script to handle queries better, the performance of the script is inadequate for large-scale CI jobs.
One solution to this is to ditch the interval tree data structure and instead rely on sorted PAF input. For large PAF files, this may still take a significant amount of time, though it should significantly reduce the memory usage (requiring only two PAF records to be kept in memory at a time; currently, all truth set records are maintained in memory).
Another option would be to provide random access to bgzipped PAF files, either through TABIX or some other API.