Simmetrics / simmetrics

Similarity or Distance Metrics, e.g. Levenshtein, for Java
Apache License 2.0
340 stars 77 forks source link

Jmh #15

Closed twillouer closed 9 years ago

twillouer commented 9 years ago

Hi,

I work with "Jaro" on a project, and the Profiler spot some things who can be optimized.

from: JaroBench.compare Web Database Applications WebRAD: Building Database Applications on the Web with Visual FoxPro and Web Connection thrpt 200 814285,208 ± 10236,226 ops/s JaroBench.compare Web Database Applications WebRAD: Building Database Applications on the Web with Visual FoxPro and Web Connection avgt 15 1207,234 ± 65,726 ns/op

to: (after the optimisation) JaroBench.compare Web Database Applications WebRAD: Building Database Applications on the Web with Visual FoxPro and Web Connection thrpt 200 1303910,930 ± 8807,522 ops/s JaroBench.compare Web Database Applications WebRAD: Building Database Applications on the Web with Visual FoxPro and Web Connection avgt 15 763,480 ± 29,435 ns/op

If you don't want the bench structure and only the optimization, you can take the latest commit.

mpkorstanje commented 9 years ago

Very interesting. Checking it out.

twillouer commented 9 years ago

sorry for travis, corrected..

mpkorstanje commented 9 years ago

I've implemented your optimizations and added few more. These have been released as version 3.0.2 and should be available on maven central in about 2 hours. Search results may take longer to update.

Used caliper for benchmarking. Shows very nice results indeed!. Thanks allot.

There is probably a bit more that can be done to optimize. getCommonCharacters only destroys charsB, so charsA could be reused. Another optimization would be skipping making a copy of common, it won't matter for the score because both (a,b) and (b,a) have the same length and are zero padded. But I don't have time to expand the Caliper test today.

mpkorstanje commented 9 years ago

FYI: 3.0.3 is going up. Added the optimizations mentioned above for Jaro. Just about the most that can be done without mangling the code into unreadability.

Caliper.