YuLab-SMU / treeio

:seedling: Base Classes and Functions for Phylogenetic Tree Input and Output

94 stars 24 forks source link

speed up read.beast #118

Closed SimonGreenhill closed 9 months ago

SimonGreenhill commented 9 months ago

Description

read.beast is very slow. This PR speeds up the code by using perl compatible regexes. In a benchmark this is about ~79% faster:

label           min     median    mem_alloc  total_time  n_itr
original (OLD)  17.7s      18s    60.3MB      3.01m    10
pcre (NEW)      3.71s    3.77s    60.3MB     37.71s    10

I've also added a more complete test case to make sure the beast trees are loaded correctly, and the relevant tree annotations are loaded properly.

Related Issue

116 & #117 are merged here as I was finding the test warnings annoying.

SimonGreenhill commented 9 months ago

Checking whether the treefile is translated or not is expensive, so ff319a2 also only does this once and gain another 4% decrease in run time.

SimonGreenhill commented 9 months ago

This is now a 82% decrease in run time.

  label          min   median mem_alloc total_time n_itr
  <chr>     <bch:tm> <bch:tm> <bch:byt>   <bch:tm> <int>
1 original     17.7s      18s    60.3MB      3.01m    10
2 update          3s    3.15s    56.5MB     31.38s    10

I'm sure further optimisations are possible - e.g. the phylogenies are already loaded by read.beast, and these could be directly passed into read.stats_beast_internal, to save the repeated text->phylo->text conversion there, but I think I'll leave it there for this PR