bioperl / bioperl-live-redmine

Legacy tickets migrated from the OBF Redmine issue tracker: http://redmine.open-bio.org
0 stars 0 forks source link

Reducing the time complexity of Bio/TreeIO/NewickParser.pm #148

Open cjfields opened 8 years ago

cjfields commented 8 years ago

Author Name: Thurston Dang (Thurston Dang) Original Redmine Issue: 3428, https://redmine.open-bio.org/issues/3428 Original Date: 2013-04-19 Original Assignee: Greg Jordan


The run-time of NewickParser.pm is quadratic in the length of the input, which is noticeably slow for large trees (e.g., 4000 nodes).

The quadratic complexity arises from:

  1. $token = substr($$string,0,1); (and the $index variant): this copies the entire string
  2. @ foreach my $dl (delims) { my $pos = index($$string, $dl);: this will usually look through the entire string to find the “;” delimiter, even though we only need the first delimiter (mostly not the “;”).

Two simple corresponding modifications are:

  1. convert the string into an array, so that the substr can be replaced with shifts
  2. terminate as soon as any of the delimiters are found

Taken together, NewickParser would have linear run-time.

A straight-forward implementation of the above is attached (NewickParserB.diff). A micro-optimized version is also attached (NewickParserB2.diff). It’s perhaps ~10% faster because it does not repeatedly split the delimiter string, but the code is much messier. Both pass t/Tree/TreeIO/newick.t.

cjfields commented 8 years ago

Original Redmine Comment Author Name: Jason Stajich Original Date: 2013-07-17T22:47:38Z


Greg - can you take a look at this patch. I’m surprised that the overhead for creating an array first by split is not higher than just doing substr - but it sounds like this is due to passing of string references and dereferencing it?

cjfields commented 8 years ago

Original Redmine Comment Author Name: Thurston Dang Original Date: 2013-07-17T23:00:13Z


The next_token function has:

    $token = substr($$string,0,1);
    $$string = substr($$string, 1);

(and similarly with $index)

The second line copies roughly the entire string; this is expensive because it’s repeated for every token operation. Splitting into an array first has a one-off cost, but every next_token becomes much cheaper.

Here’s another example without dereferencing, showing that repeatedly removing the first character by substr is quadratic complexity:

bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; foreach (1 .. $ARGV [0]) { $str = substr ($str, 1); }' 100000

real    0m3.463s
user    0m3.458s
sys     0m0.004s
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; foreach (1 .. $ARGV [0]) { $str = substr ($str, 1); }' 200000

real    0m15.372s
user    0m15.366s
sys     0m0.005s
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; foreach (1 .. $ARGV [0]) { $str = substr ($str, 1); }' 400000

real    0m59.488s
user    0m59.480s
sys     0m0.006s

while splitting into an array, then repeatedly shifting is linear:

bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; my @str = split //, $str; foreach (1 .. $ARGV [0]) { shift @str; }' 100000

real    0m0.142s
user    0m0.119s
sys     0m0.018s
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; my @str = split //, $str; foreach (1 .. $ARGV [0]) { shift @str; }' 200000

real    0m0.279s
user    0m0.238s
sys     0m0.031s
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; my @str = split //, $str; foreach (1 .. $ARGV [0]) { shift @str; }' 400000

real    0m0.562s
user    0m0.486s
sys     0m0.057s
cjfields commented 8 years ago

Original Redmine Comment Author Name: Greg Jordan Original Date: 2013-07-18T16:25:03Z


This looks good to me, esp. if it passes all the tests — that’s what they’re there for. :)

I always noticed the newick parsing was extremely slow for larger trees but never had the time to look deeply into it. I think I had taken most of this more or less verbatim from the Ensembl codebase without much of a critical eye, so I’m not surprised that it’s ripe for optimization.

I’d say stick with the simpler but ever slightly less efficient version rather than the more complicated one, for maintainability purposes.