clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
Apache License 2.0
418 stars 101 forks source link

LexiconNER (and HashTrie) needs to be smaller, faster #302

Closed kwalcock closed 5 years ago

kwalcock commented 5 years ago

Mihai wrote this once upon a time:

Also, on a different topic: processors has gotten pretty fat. Some of it is out of our control (CoreNLP models), but some of it is our stuff. In particular, we use a lot the LexiconNER class, which stores its internal data (basically multi-word dictionaries) as a HashTrie:

https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/struct/HashTrie.scala

This is a class that I implemented a while ago, and I am fairly sure it’s sub-optimal. At some point I’d like to brainstorm with you about optimizing it…

kwalcock commented 5 years ago

Here is some of the discussion that followed, simply pasted from the email.

Hi Keith, Sorry about the delay on this! Comments inline.

On Tue, Dec 11, 2018 at 1:09 PM Keith Alcock keith@keithalcock.com wrote: So I've made lots of changes to HashTrie and LexiconNER which should not alter any results and should probably not have impact on other files or projects (only probably, because some public defs were moved). I still have to do the LexiconNER object and verify with space and time measurements and do regression tests. However, there are many other possible changes that may be worthwhile but will have consequences outside this environment (including serialization) and someone else should decide their fate.

  1. uniqueStrings. These are kept in HashTrie for statistics but probably at a significant price. "Improvements" might be

If they are not used in a significant way outside of the class (can't check right now, I don't have a laptop with IntelliJ here) then I agree they should at least be made optional. I think the simplest option would be the second one: include a debug argument in constructor, and have it set by default to false. And uniqueStrings should be an Option, as you say.

  1. The labelling in HashTrie.findNormalized and LexiconNER.findLongestMatch.

This is done with string addition which may result in many instances of strings that are the same. I've used some lazy vals and calculate only once per call in HashTrie or once per hit in LexiconNER, but this is still inefficient. Here is one work around:

You mean in places like this: https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/sequences/LexiconNER.scala#L97 Yeah, you're right... I think the simplest change would your first: compute the "B_..." and "I-..." labels just once in the c'tor.

A more invasive change would be to completely replace this string accounting with integers in a C style and resort to strings only for output. It could look something like this:

val LOC = 0x01 val TIME = 0x02 val GEO = 0x03

val OUTSIDE = 0x00 val BEGIN = 0x10 val INSIDE = 0x20

val B_LOC = BEGIN | LOC val I_LOC = INSIDE | LOC

def printTag(tag: Int) = if (tag & BEGIN) ... else if (tag & INSIDE) ... etc.

  1. Interning strings. Processor.internString(casedS) is being called optionally. It may be that it should always be called and with the interned strings tracked on a per HashTrie basis. Either way, if == strings are at some point mapped to eq() objects, the entries HashMap can be changed into a Java IdentityHashMap which is probably more efficient, possibly much more. (It's not just the eq, but identityHashCode that saves time.) If entries, which already tracks the strings is used, there would also be no double bookkeeping of the interned strings (except where different KBs have the same strings).

If interning is always performed, it should be possible to replace the token of the TrieNode with an integer, either from the order it was seen as it was added to entries or the order after all the words are alphabetized. It would be very quick to look for an integer within an integer array, whether values are sorted or not.

This is a good idea. At some point, all Strings in processors were interned. But this causes another problem: interned Strings are stored in a different heap memory in the JVM (I can't recall what they call it...), which is typically much smaller than the regular heap. When I was doing this at scale, we used to run a lot into weird OOM errors because of this. We can increase the size of this heap in the JVM, but I think it is always much smaller than the regular heap. So, long story short: it is probably a good idea to intern Strings that live for a long time (e.g., labels such as the examples in (2)) but probably not all Strings. Do you have a different take here?

  1. Ordering strings in the ListBuffer. Since the strings are ordered, a binary search might be better, assuming that there is a reasonable number of siblings. This may not be efficient on a ListBuffer, but that could be changed to an ArrayBuffer or an Array. It may be more efficient to convert the buffers to normal Lists or Arrays after the structure has been built so that lookup is faster.

Which ListBuffers are you referring to? In any case, I think this is a good idea!

  1. Integers instead of references/pointers. The collection of TrieNodes can eventually be flattened into a single array of integers, which would use quite a bit less memory and serialize more quickly. entries turns into a IdentityHashMap[String, Int] with the Int providing the offset into an array of integers formatted like

completePathFlag childCount child0 string id child1 string id child0 offset child1 offset

To do this, the HashTrie would be built close to normal with add(tokens), but then converted into a compressed version optimized for the find functions.

Another good idea!

That's the list right now. It may be difficult to discuss this in writing; we might want to talk about it.

Keith

On Mon, Dec 10, 2018 at 6:16 PM Mihai Surdeanu surdeanu@gmail.com wrote: Yes,

The project you’re looking for is this one: https://github.com/clulab/bioresources

Just FYI, not relevant at this point, these resources are created from other dictionaries by this script: https://github.com/clulab/bioresources/blob/master/ner_kb.sh (but I don’t think we’re changing the format of the files at this point, so not important)

Thanks!

On Dec 10, 2018, at 5:26 PM, Keith Alcock keith@keithalcock.com wrote:

Mihai,

I've been working on the optimization of HashTrie and LexiconNER in processors and will soon write or say a small report. I'd like to test on some data and it seems like I should be looking for files like these which I haven't been able to track down in GitHub. Can you provide another hint? Thanks.

Keith

"org/clulab/reach/kb/ner/Gene_or_gene_product.tsv.gz",
"org/clulab/reach/kb/ner/Family.tsv.gz",
"org/clulab/reach/kb/ner/Cellular_component.tsv.gz",
"org/clulab/reach/kb/ner/Simple_chemical.tsv.gz",
"org/clulab/reach/kb/ner/Site.tsv.gz",
"org/clulab/reach/kb/ner/BioProcess.tsv.gz",
"org/clulab/reach/kb/ner/Species.tsv.gz",
"org/clulab/reach/kb/ner/CellLine.tsv.gz",
"org/clulab/reach/kb/ner/TissueType.tsv.gz",
"org/clulab/reach/kb/ner/CellType.tsv.gz",
"org/clulab/reach/kb/ner/Organ.tsv.gz"

On Wed, Nov 28, 2018 at 8:47 AM Mihai Surdeanu surdeanu@gmail.com wrote:

On Nov 27, 2018, at 4:48 PM, Keith Alcock keith@keithalcock.com wrote:

Mihai,

The rule of thumb about performance is to measure, of course, so I did so with Eidos. Apparently we only use two small files for NER: Quantifier.tsv and Property.tsv. These load in around 24ms total for the pair. Searching for named entities took 19ms for a fairly large, 200KB text file. For Eidos, it doesn't seem like this is a bottleneck. That's probably different with other projects, perhaps particularly with Reach. Do you know of a especially problematic KB file that I can target for tests?

yes, the ruleNer in BioNLPProcessor uses massive lexicons: https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/bionlp/ner/HybridNER.scala#L23 The HashTries loaded by it are huge. Any memory optimization would help there, I think.

For runtime, this loop should be optimized: https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/bionlp/ner/HybridNER.scala#L32

There are always small improvements to make, but I don't knew whether they're worth in. In this code below from HashTrie, it looks like an array of known size, the sentence length, is being constructed with list-like operations. There is a similar case in LexiconNER. (FWIW the array doesn't especially have to be created at all because it will be merged with an existing array. Memory usage is a problem, sometimes more than speed.) Things like that might halve the time for NER, but probably won't cut overall time by much. Sometimes it's just nice to have them done, though.

Sorry, what exactly are you suggesting here? Thanks!

private def findNormalized(sequence:Array[String], label:String, outsideLabel:String):Array[String] = { var offset = 0 val labels = new ArrayBuffer[String]() while(offset < sequence.length) { val span = findAt(sequence, offset) if(span > 0) { labels += "B-" + label for(i <- 1 until span) { labels += "I-" + label } offset += span } else { labels += outsideLabel offset += 1 } } labels.toArray }

Keith

On Mon, Nov 26, 2018 at 6:30 PM Mihai Surdeanu surdeanu@gmail.com wrote: Hi Keith,

On Nov 26, 2018, at 4:52 PM, Keith Alcock keith@keithalcock.com wrote:

Mihai,

Regarding the HashTrie, as far as I can tell, neither Scala nor Java has a plain old tree in the standard library to borrow from and org.clulab.struct.Tree looks like it has a special purpose. I don't know if the optimization of the top level HashMap[String,TrieNode] is paying off.

It is certainly paying off at runtime. These lexicons store phrases that are, by and large, single words, with a long tail for multi-word phrases. So, the first level in the trie tends to be extremely wide, while levels beyond the first tend to be very narrow with very few siblings. This is why I chose a HashMap for the first layer, so the search in the first layer is efficient. However, as I am sure you know, HashMaps are expensive, because the array that stores the keys is often much larger than the number of keys actually stored. I wonder if there is a good way to compress the key array?

What worries me a little is var children:Option[ListBuffer[TrieNode]] in the TrieNode in combination with this code:

for(i <- children.indices) { val child = children(i) val compare = newChild.token.compareTo(child.token) if(compare < 0) { children.insert(i, newChild)

If the ListBuffer is just a List which can count and insert things, this looks like an O(n^2) operation when it could be O(n) (or less) with a(n unfortunately deprecated) LinkedList or something like that, possibly homemade. Even though these children are sorted, the search for a matching child is forced to be linear because of the list. If they were stored in an array, then a binary search could speed things up. For that matter, arrays can be nicely sorted in bulk after all insertions, but whether a quicksort would make up for having to look through the whole list is unknown. If the KBs were preprocessed to be alphabetical in the first place, then insertion wouldn't be necessary. Each of these lists could be a HashTrie itself and then the ordering wouldn't matter, but then there would be some unnecessary hashing being done and overhead if a tree is larger than a list.

I think you’re correct that beyond level 1, it is much better to store siblings in a sorted array. As I mentioned above, these levels tend to be very narrow, so creating new arrays when inserting longer phrases should be fairly cheap. And the memory savings compared to those ListBuffers should add up. I think this is worth doing!

It looks like the HashTrie is used only by LexiconNER and that LexonNER might be serialized. I don't know how long that's taking, but sometimes the native format is slow.

Those are some initial observations anyway, probably all obvious.

Keith

kwalcock commented 5 years ago

So as far as I understand it, there is a program, KBGenerator, in the bioresources project that takes a configuration file that specifies a mapping of various knowledge base (KB) contents onto labels. For example, biopax-simple_chemical onto Simple_chemical. Multiple KBs can map onto the same label, so the program combines them. The result is several text files, one for each label.

There is another program, KBLoader, that can take all these text files, put them into a NER, and serialize it, usually into model.ser.gz. It is my impression from debugging and also from cluprocessorbio.conf that this serialized version is not normally used. Rather, the text files generated by KBGenerator are read in again. This might be due to performance issues. In my test, it took 57 seconds to read in a serialized version of a LexiconNER while reading it in from text files only required 4 seconds.

Update: The 57 seconds was for an unbuffered FileInputStreamReader. With a BufferedInputStream, time to load the old format with buffering was approximately the same as time for the new format without buffering, under 4 seconds. This isn't a significant difference. Time to load isn't the deciding factor.

kwalcock commented 5 years ago

There is a reimplementation of HashTrie that addresses some of the issues itemized above, but see below for one more important one that looks like it can speed up runtime performance by 5x.

  1. I moved the tracking of unique strings to a debugging subclass which is normally not used. This saves a little time and memory.
  2. Labels are calculated only once and no longer added together constantly. They are not converted into integers, however, because that would require changes far and wide.
  3. I decided not to use an IdentityHashMap for the strings. This would require two lookups: one to map the string to an identity and another to map the identity to a TrieNode. However, now the entries are indeed a HashMap<String, Int> where the Int is an offset into an array of what used to be TrieNodes. Those no longer exist. Their information is kept in other Arrays of Ints which are offsets of their children and records of the strings involved.
  4. The child TrieNodes are no longer stored in a Option[ListBuffer[TrieNode]] but in a slice of an Array[Int]. They have been sorted and are found using binarySearch. The original lists could also be sorted if necessary. The sorting isn't especially advantageous in an average case (I used org/clulab/processors/eng.testa) because there aren't many TrieNodes with a large number of children. However, they do exist and sorting helps in pathalogical cases (e.g., processing of the KB itself as text). Some performance numbers will follow.
  5. As mentioned above, there is no longer a TrieNode. There are just Ints in three Arrays. This is accomplished by allowing the normal HashTrie to be built and then converting it into a CompactHashTrie. The HashTrie is quite conventient for adding new tokens, and it now satisfies an interface HashTrieAdder, while the compact version is produced after the adding is finished and is a HashTrieFinder, optimized for looking things up and low memory use.
MihaiSurdeanu commented 5 years ago

@kwalcock: you are correct with your reading of KBGenerator and KBLoader.

MihaiSurdeanu commented 5 years ago

@kwalcock: your implementation sounds interesting. Where is it? Also, a good sanity check is to rerun all the Reach unit tests. See this script: https://github.com/clulab/reach/blob/master/runAllTests.sh Of course, for this, you have to publish processors locally, and adjust the Reach build file to use the locally-published version.

If the unit tests pass, and our review of the code is fine, I am all for pushing this change.

kwalcock commented 5 years ago

To be continued...

kwalcock commented 5 years ago

On average text, which basically has no bio NEs, the new version is approximately 10% faster. One test shows 5490ms vs. 4922ms. On pathological texts which have more than the expected number of NEs, such as the KB itself, the new version looks to be twice as fast. One test shows 7001ms vs. 3029ms. The binary search might not be responsible, since that seemed to make negligible difference: 3029ms vs. 2991ms. It might instead be the slower string comparisons in the old version. However, this pathalogical case is probably not the worst case. There are TrieNodes with up to 1854 children and I think they will slow down the linear search substantially. There are also NEs up to 350 tokens in length FWIW. If need be I can time the absolute worst case scenarios.

kwalcock commented 5 years ago

It seems to be difficult to compare memory usage in Java. If the size of the serialized LexiconNER can be used as an approximation, the old version comes in at around 40MB and the new one at 18MB. The majority is for the string map. Some 45% of NEs are one token only and 49% are two tokens.

kwalcock commented 5 years ago

Up to this point, the LexiconNER does not have to be changed much. I added some interfaces and one function call (finishAdding) to convert the TrieNode from one used for adding new entries to one for finding the NEs. The next optimization does require some modification, though.

It is possible to combine all the separate HashTries into one large hash trie. Instead of each TrieNode storing a boolean for completePath, it stores an integer which specifies which matcher it came from. LexiconNER goes through its list of matchers as it gets built and passes the index that the matcher would have (if it really existed anymore). That gets put into the TrieNode. When a NE is eventually found, that index gets returned and it is used to create (or look up) the label, which is really the only part of the individual matchers that is left.

The rough draft of this shows it being around 5.3 times faster (3029ms vs. 574ms) on the pathological data. Right now it's hacked into the existing code. The final version would require modifying more of the code (which I am reluctant to do without this report) or making a separate class to use instead.

kwalcock commented 5 years ago

Modifications made one at a time, making it difficult to see the overall picture, especially with all the test code, are in the branch kwalcock-HashTrie. A squashed version which skips intermediate steps is soon to be kwalcock-CombinedCompactHashTrie. Remember that it is not complete.

myedibleenso commented 5 years ago

Chiming in to just say that I'm excited to see efficiency improvements to LexiconNER. As an anecdote, JFR shows that, after parsing, LexiconNER operations are where we're taking the biggest hit in our project.

MihaiSurdeanu commented 5 years ago

@kwalcock: any updates on this? Thanks! (Remember that the key test here is to make sure that unit tests in Reach pass)

kwalcock commented 5 years ago

No. The changes to date have mostly preserved the current functionality and added optional relatively minor speedups or compression. The next change, combining trees into one, which is in rough draft here, requires some internal changes to LexiconNER that would more so replace current functionality. So does LexiconNER need to become OldLexiconNER and everything gets the updated LexiconNER? Or is there a NewLexiconNER that can optionally be used? There are related choices to be made and it might help to discuss them.

MihaiSurdeanu commented 5 years ago

Maybe we should:

  1. Move core functionality into a LexiconNER trait.
  2. Implement the above trait in the current LexiconNER class, and rename it TrieLexiconNER.
  3. Implement the above trait in your new class, and call it ArrayLexiconNER, or something like that.
kwalcock commented 5 years ago

1 sounds good 2 & 3 implies keep both versions, OK It's more LexiconNER => SeparatedLexiconNER and a new CombinedLexiconNER. Either one or both can have regular HashTrie storage behind it and/or CompactHashTrie. The tries have to be slightly different for the Separated and Combined versions. The separated one keeps a Boolean for completePath and the combined one keeps an Int for which Lexicon completes the path. That's a slightly troubling number of combinations, but OK.

MihaiSurdeanu commented 5 years ago

Do you have a better idea in mind?

kwalcock commented 5 years ago

@MihaiSurdeanu, FWIW I worked on this again last week. It's in the testing stage now. It's contained in a branch kwalcock-DuoHashTrie now, mostly for backup purposes and for the curious. Quite a bit had to be revised for proper integration, so I changed branches.

MihaiSurdeanu commented 5 years ago

Thank you!

hickst commented 5 years ago

Just FYI, if you're still using Reach: it is very dependent on being able to read and use Bioresources through Processors and will probably have to be modified/updated for these changes, too.

kwalcock commented 5 years ago

Thanks @hickst. It has been tested quite a bit with the BioCluProcessor and is being tested with reach right now before eventually merging.

kwalcock commented 5 years ago

One recent sets of runs on pathological data gave the times below. Data was the first 5MB of the very knowledgebase files that go into the LexiconNER in the first place. They are from the bioresources project in the src/main/resources/org/clulab/reach/kb directory. They are pathological because the density of NERs is much higher than would be in normal text. This in turn makes these numbers probably look better than they would otherwise be.

Original: 13,761ms Separated: 10,057ms (this is same design as original, just streamlined a bit) Combined: 5,177ms (a single trie instead of multiple tries) Compact: 2,450ms (no longer really a trie, just arrays)

That's around 5.6 times as fast as the original.

MihaiSurdeanu commented 5 years ago

Very nice!


From: Keith Alcock notifications@github.com Sent: Tuesday, February 26, 2019 7:47:43 PM To: clulab/processors Cc: Mihai Surdeanu; Mention Subject: Re: [clulab/processors] LexiconNER (and HashTrie) needs to be smaller, faster (#302)

One recent sets of runs on pathological data gave the times below. Data was the first 5MB of the very knowledgebase files that go into the LexiconNER in the first place. They are from the bioresources project in the src/main/resources/org/clulab/reach/kb directory. They are pathological because the density of NERs is much higher than would be in normal text. This in turn makes these numbers probably look better than they would otherwise be.

Original: 13,761ms Separated: 10,057ms (this is same design as original, just streamlined a bit) Combined: 5,177ms (a single trie instead of multiple tries) Compact: 2,450ms (no longer really a trie, just arrays)

That's around 5.6 times as fast as the original.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/clulab/processors/issues/302#issuecomment-467702197, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABH-zjlCejEpPKLX6AOJN3Wl5sA79lVrks5vRfHPgaJpZM4aiN3N.

myedibleenso commented 5 years ago

@kwalcock , can this be closed?

kwalcock commented 5 years ago

Yes. Perhaps some other part is the slowpoke now. Thanks.