ComparativeGenomicsToolkit / hal2vg

Convert HAL to VG
MIT License
21 stars 2 forks source link

[WIP] Pinch across SNPs #27

Closed glennhickey closed 4 years ago

glennhickey commented 4 years ago

Currently, hal2vg only pinches exact matches along branches. This is a problem, particularly for star trees, as homologies between sibling genomes will be missed in the presence of SNPs -- ie if they have a different base than the ancestor.

This PR adds a patch to, when a SNP is found wrt the parent, use a column iterator to find exact homologous matches further away in the tree. This should catch all missing homologies, but I'm concerned about speed and memory.

Resolves #26

glennhickey commented 4 years ago

Slower but not disastrously so (about 30% on chr20).


Command being timed: "hal2vg lc2019_12ont-hg38.cactus.minimap2_star-all-to-ref-fatanc-no-secondary-july-8.hal --progress --inMemory --onlySequenceNames"

Original:
        User time (seconds): 158.03
        System time (seconds): 2.34
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:40.47
        Maximum resident set size (kbytes): 3973656

Fix:
      User time (seconds): 201.93
      System time (seconds): 2.73
      Percent of CPU this job got: 99%
      Elapsed (wall clock) time (h:mm:ss or m:ss): 3:25.34
      Maximum resident set size (kbytes): 4049860

Original stats:
     nodes  5859015
     edges  10307039
     length 80506059

Fix:
     nodes  2912990
     edges  4474872
     length 76935653