Open cjfields opened 9 years ago
Original Redmine Comment Author Name: Jason Stajich Original Date: 2013-07-17T22:47:38Z
Greg - can you take a look at this patch. I’m surprised that the overhead for creating an array first by split is not higher than just doing substr - but it sounds like this is due to passing of string references and dereferencing it?
Original Redmine Comment Author Name: Thurston Dang Original Date: 2013-07-17T23:00:13Z
The next_token function has:
$token = substr($$string,0,1);
$$string = substr($$string, 1);
(and similarly with $index)
The second line copies roughly the entire string; this is expensive because it’s repeated for every token operation. Splitting into an array first has a one-off cost, but every next_token becomes much cheaper.
Here’s another example without dereferencing, showing that repeatedly removing the first character by substr is quadratic complexity:
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; foreach (1 .. $ARGV [0]) { $str = substr ($str, 1); }' 100000
real 0m3.463s
user 0m3.458s
sys 0m0.004s
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; foreach (1 .. $ARGV [0]) { $str = substr ($str, 1); }' 200000
real 0m15.372s
user 0m15.366s
sys 0m0.005s
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; foreach (1 .. $ARGV [0]) { $str = substr ($str, 1); }' 400000
real 0m59.488s
user 0m59.480s
sys 0m0.006s
while splitting into an array, then repeatedly shifting is linear:
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; my @str = split //, $str; foreach (1 .. $ARGV [0]) { shift @str; }' 100000
real 0m0.142s
user 0m0.119s
sys 0m0.018s
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; my @str = split //, $str; foreach (1 .. $ARGV [0]) { shift @str; }' 200000
real 0m0.279s
user 0m0.238s
sys 0m0.031s
bash-3.2$ time perl -e 'my $str = "a" x $ARGV [0]; my @str = split //, $str; foreach (1 .. $ARGV [0]) { shift @str; }' 400000
real 0m0.562s
user 0m0.486s
sys 0m0.057s
Original Redmine Comment Author Name: Greg Jordan Original Date: 2013-07-18T16:25:03Z
This looks good to me, esp. if it passes all the tests — that’s what they’re there for. :)
I always noticed the newick parsing was extremely slow for larger trees but never had the time to look deeply into it. I think I had taken most of this more or less verbatim from the Ensembl codebase without much of a critical eye, so I’m not surprised that it’s ripe for optimization.
I’d say stick with the simpler but ever slightly less efficient version rather than the more complicated one, for maintainability purposes.
Author Name: Thurston Dang (Thurston Dang) Original Redmine Issue: 3428, https://redmine.open-bio.org/issues/3428 Original Date: 2013-04-19 Original Assignee: Greg Jordan
The run-time of NewickParser.pm is quadratic in the length of the input, which is noticeably slow for large trees (e.g., 4000 nodes).
The quadratic complexity arises from:
$token = substr($$string,0,1);
(and the$index
variant): this copies the entire stringdelims) { my $pos = index($$string, $dl);
: this will usually look through the entire string to find the “;” delimiter, even though we only need the first delimiter (mostly not the “;”).Two simple corresponding modifications are:
substr
can be replaced withshift
sTaken together, NewickParser would have linear run-time.
A straight-forward implementation of the above is attached (
NewickParserB.diff
). A micro-optimized version is also attached (NewickParserB2.diff
). It’s perhaps ~10% faster because it does not repeatedly split the delimiter string, but the code is much messier. Both pass t/Tree/TreeIO/newick.t.