dschuhmacher / kanjistat

R package for analyzing Japanese kanji
https://dschuhmacher.github.io/kanjistat/
GNU General Public License v3.0
4 stars 1 forks source link

Further speed improvements #8

Closed dschuhmacher closed 5 months ago

dschuhmacher commented 5 months ago

profiling kanjidistmatcomputation of 壇 vs 増,垣,槽 at seg_depth 4 with approx="pcweighted" and density=30 reveals that 380 out of 430ms are spent in component_cost. Of these 200ms are spent in unbalanced (there is nothing we can do about that), but 70ms are spent in parse_svg_path (which seems a lot), hardly anything else in points_from_svg (which seems surprising), but still 110ms in the rest of component_cost.

A larger comparison of completely random kanji gives similar percentages (but allots the full runtime to component_cost and only about 10 percent to parse_svg_path (probably due to fewer strokes in the kanji).

It is not that much that can be gained. But it seems improving somewhat on parse_svg_path should be easily(?) possible (at most 10/70ms are in the gsub) by using more basic string operations (package stringi??) and maybe implementing it in C++. Also the percentage of time taken "by the other commands" (except for unbalanced and points_from_svg) in component_cost seems rather too large.

dschuhmacher commented 5 months ago

It appears to me that the best solution is making the parsed svg path part of the kanjivec format rather than the original d-string. This avoids repeated parsing and removes the 70ms completely from kanjidist.

dschuhmacher commented 5 months ago

Implemented in kanjistat v0.13.0