LHNCBC / metamaplite

A near real-time named-entity recognizer
https://metamap.nlm.nih.gov/MetaMapLite.shtml
Other
55 stars 14 forks source link

Performance improvements to ERTokenImpl.equals() #28

Closed stevenbedrick closed 1 year ago

stevenbedrick commented 1 year ago

I noticed some performance impacts after my fix to equals() in ERTokenImpl, and some quality time with the profiler pointed me back to this method. The way I'd been doing the equals() check in ERTokenImpl relied on toString() and then a string comparison; toString() itself involved some expensive string concatenation, which was being doubled- once for the "source" token and once for the "other" token. Since equals() was getting used inside of a lot of inner loops, this means a lot of object allocation happening over and over, plus string comparison ain't free.

I have re-implemented equals() to directly compare the various fields; on the surface, this method might look more expensive than just doing toString() twice, since it involves three string comparisons in the worst case. However, the way the method is written, equals() begins with a simple int comparison that is likely to fail- the only time the string comparisons kick in is if the two tokens had the same offset, which is a relatively less common comparison. So we're getting a boost in the most common case. It's a little hard to say quantitatively how much this is getting us but very anecdotally in my test input I'm seeing about a 5-10% speedup for long documents with a lot of small matches.

One open question in my mind: you'll note that there's a lot of repeated downcasting going on in that check, and of course downcasting ain't free. I initially had allocated a local variable to store the result of the cast of anotherToken and used that variable in the subsequent checks, but then I had the thought that having a stack allocation inside every call to .equals() is probably not a good idea, either. I'm at the limit of my knowledge on these kind of low-level Java details, and maybe it's the sort of thing that would be optimized away by the JVM?

willjrogers commented 1 year ago

Casting is much less expensive than string concatenation or the use of StringBuilder, a faster alternative to string concatenation.