Closed kwalcock closed 5 years ago
Here is some of the discussion that followed, simply pasted from the email.
Hi Keith, Sorry about the delay on this! Comments inline.
On Tue, Dec 11, 2018 at 1:09 PM Keith Alcock keith@keithalcock.com wrote: So I've made lots of changes to HashTrie and LexiconNER which should not alter any results and should probably not have impact on other files or projects (only probably, because some public defs were moved). I still have to do the LexiconNER object and verify with space and time measurements and do regression tests. However, there are many other possible changes that may be worthwhile but will have consequences outside this environment (including serialization) and someone else should decide their fate.
If they are not used in a significant way outside of the class (can't check right now, I don't have a laptop with IntelliJ here) then I agree they should at least be made optional. I think the simplest option would be the second one: include a debug argument in constructor, and have it set by default to false. And uniqueStrings should be an Option, as you say.
This is done with string addition which may result in many instances of strings that are the same. I've used some lazy vals and calculate only once per call in HashTrie or once per hit in LexiconNER, but this is still inefficient. Here is one work around:
You mean in places like this: https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/sequences/LexiconNER.scala#L97 Yeah, you're right... I think the simplest change would your first: compute the "B_..." and "I-..." labels just once in the c'tor.
A more invasive change would be to completely replace this string accounting with integers in a C style and resort to strings only for output. It could look something like this:
val LOC = 0x01 val TIME = 0x02 val GEO = 0x03
val OUTSIDE = 0x00 val BEGIN = 0x10 val INSIDE = 0x20
val B_LOC = BEGIN | LOC val I_LOC = INSIDE | LOC
def printTag(tag: Int) = if (tag & BEGIN) ... else if (tag & INSIDE) ... etc.
If interning is always performed, it should be possible to replace the token of the TrieNode with an integer, either from the order it was seen as it was added to entries or the order after all the words are alphabetized. It would be very quick to look for an integer within an integer array, whether values are sorted or not.
This is a good idea. At some point, all Strings in processors were interned. But this causes another problem: interned Strings are stored in a different heap memory in the JVM (I can't recall what they call it...), which is typically much smaller than the regular heap. When I was doing this at scale, we used to run a lot into weird OOM errors because of this. We can increase the size of this heap in the JVM, but I think it is always much smaller than the regular heap. So, long story short: it is probably a good idea to intern Strings that live for a long time (e.g., labels such as the examples in (2)) but probably not all Strings. Do you have a different take here?
Which ListBuffers are you referring to? In any case, I think this is a good idea!
completePathFlag childCount child0 string id child1 string id child0 offset child1 offset
To do this, the HashTrie would be built close to normal with add(tokens), but then converted into a compressed version optimized for the find functions.
Another good idea!
That's the list right now. It may be difficult to discuss this in writing; we might want to talk about it.
Keith
On Mon, Dec 10, 2018 at 6:16 PM Mihai Surdeanu surdeanu@gmail.com wrote: Yes,
The project you’re looking for is this one: https://github.com/clulab/bioresources
Just FYI, not relevant at this point, these resources are created from other dictionaries by this script: https://github.com/clulab/bioresources/blob/master/ner_kb.sh (but I don’t think we’re changing the format of the files at this point, so not important)
Thanks!
On Dec 10, 2018, at 5:26 PM, Keith Alcock keith@keithalcock.com wrote:
Mihai,
I've been working on the optimization of HashTrie and LexiconNER in processors and will soon write or say a small report. I'd like to test on some data and it seems like I should be looking for files like these which I haven't been able to track down in GitHub. Can you provide another hint? Thanks.
Keith
"org/clulab/reach/kb/ner/Gene_or_gene_product.tsv.gz", "org/clulab/reach/kb/ner/Family.tsv.gz", "org/clulab/reach/kb/ner/Cellular_component.tsv.gz", "org/clulab/reach/kb/ner/Simple_chemical.tsv.gz", "org/clulab/reach/kb/ner/Site.tsv.gz", "org/clulab/reach/kb/ner/BioProcess.tsv.gz", "org/clulab/reach/kb/ner/Species.tsv.gz", "org/clulab/reach/kb/ner/CellLine.tsv.gz", "org/clulab/reach/kb/ner/TissueType.tsv.gz", "org/clulab/reach/kb/ner/CellType.tsv.gz", "org/clulab/reach/kb/ner/Organ.tsv.gz"
On Wed, Nov 28, 2018 at 8:47 AM Mihai Surdeanu surdeanu@gmail.com wrote:
On Nov 27, 2018, at 4:48 PM, Keith Alcock keith@keithalcock.com wrote:
Mihai,
The rule of thumb about performance is to measure, of course, so I did so with Eidos. Apparently we only use two small files for NER: Quantifier.tsv and Property.tsv. These load in around 24ms total for the pair. Searching for named entities took 19ms for a fairly large, 200KB text file. For Eidos, it doesn't seem like this is a bottleneck. That's probably different with other projects, perhaps particularly with Reach. Do you know of a especially problematic KB file that I can target for tests?
yes, the ruleNer in BioNLPProcessor uses massive lexicons: https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/bionlp/ner/HybridNER.scala#L23 The HashTries loaded by it are huge. Any memory optimization would help there, I think.
For runtime, this loop should be optimized: https://github.com/clulab/processors/blob/master/corenlp/src/main/scala/org/clulab/processors/bionlp/ner/HybridNER.scala#L32
There are always small improvements to make, but I don't knew whether they're worth in. In this code below from HashTrie, it looks like an array of known size, the sentence length, is being constructed with list-like operations. There is a similar case in LexiconNER. (FWIW the array doesn't especially have to be created at all because it will be merged with an existing array. Memory usage is a problem, sometimes more than speed.) Things like that might halve the time for NER, but probably won't cut overall time by much. Sometimes it's just nice to have them done, though.
Sorry, what exactly are you suggesting here? Thanks!
private def findNormalized(sequence:Array[String], label:String, outsideLabel:String):Array[String] = { var offset = 0 val labels = new ArrayBuffer[String]() while(offset < sequence.length) { val span = findAt(sequence, offset) if(span > 0) { labels += "B-" + label for(i <- 1 until span) { labels += "I-" + label } offset += span } else { labels += outsideLabel offset += 1 } } labels.toArray }
Keith
On Mon, Nov 26, 2018 at 6:30 PM Mihai Surdeanu surdeanu@gmail.com wrote: Hi Keith,
On Nov 26, 2018, at 4:52 PM, Keith Alcock keith@keithalcock.com wrote:
Mihai,
Regarding the HashTrie, as far as I can tell, neither Scala nor Java has a plain old tree in the standard library to borrow from and org.clulab.struct.Tree looks like it has a special purpose. I don't know if the optimization of the top level HashMap[String,TrieNode] is paying off.
It is certainly paying off at runtime. These lexicons store phrases that are, by and large, single words, with a long tail for multi-word phrases. So, the first level in the trie tends to be extremely wide, while levels beyond the first tend to be very narrow with very few siblings. This is why I chose a HashMap for the first layer, so the search in the first layer is efficient. However, as I am sure you know, HashMaps are expensive, because the array that stores the keys is often much larger than the number of keys actually stored. I wonder if there is a good way to compress the key array?
What worries me a little is var children:Option[ListBuffer[TrieNode]] in the TrieNode in combination with this code:
for(i <- children.indices) { val child = children(i) val compare = newChild.token.compareTo(child.token) if(compare < 0) { children.insert(i, newChild)
If the ListBuffer is just a List which can count and insert things, this looks like an O(n^2) operation when it could be O(n) (or less) with a(n unfortunately deprecated) LinkedList or something like that, possibly homemade. Even though these children are sorted, the search for a matching child is forced to be linear because of the list. If they were stored in an array, then a binary search could speed things up. For that matter, arrays can be nicely sorted in bulk after all insertions, but whether a quicksort would make up for having to look through the whole list is unknown. If the KBs were preprocessed to be alphabetical in the first place, then insertion wouldn't be necessary. Each of these lists could be a HashTrie itself and then the ordering wouldn't matter, but then there would be some unnecessary hashing being done and overhead if a tree is larger than a list.
I think you’re correct that beyond level 1, it is much better to store siblings in a sorted array. As I mentioned above, these levels tend to be very narrow, so creating new arrays when inserting longer phrases should be fairly cheap. And the memory savings compared to those ListBuffers should add up. I think this is worth doing!
It looks like the HashTrie is used only by LexiconNER and that LexonNER might be serialized. I don't know how long that's taking, but sometimes the native format is slow.
Those are some initial observations anyway, probably all obvious.
Keith
So as far as I understand it, there is a program, KBGenerator, in the bioresources project that takes a configuration file that specifies a mapping of various knowledge base (KB) contents onto labels. For example, biopax-simple_chemical onto Simple_chemical. Multiple KBs can map onto the same label, so the program combines them. The result is several text files, one for each label.
There is another program, KBLoader, that can take all these text files, put them into a NER, and serialize it, usually into model.ser.gz. It is my impression from debugging and also from cluprocessorbio.conf that this serialized version is not normally used. Rather, the text files generated by KBGenerator are read in again. This might be due to performance issues. In my test, it took 57 seconds to read in a serialized version of a LexiconNER while reading it in from text files only required 4 seconds.
Update: The 57 seconds was for an unbuffered FileInputStreamReader. With a BufferedInputStream, time to load the old format with buffering was approximately the same as time for the new format without buffering, under 4 seconds. This isn't a significant difference. Time to load isn't the deciding factor.
There is a reimplementation of HashTrie that addresses some of the issues itemized above, but see below for one more important one that looks like it can speed up runtime performance by 5x.
@kwalcock: you are correct with your reading of KBGenerator and KBLoader.
@kwalcock: your implementation sounds interesting. Where is it? Also, a good sanity check is to rerun all the Reach unit tests. See this script: https://github.com/clulab/reach/blob/master/runAllTests.sh Of course, for this, you have to publish processors locally, and adjust the Reach build file to use the locally-published version.
If the unit tests pass, and our review of the code is fine, I am all for pushing this change.
To be continued...
On average text, which basically has no bio NEs, the new version is approximately 10% faster. One test shows 5490ms vs. 4922ms. On pathological texts which have more than the expected number of NEs, such as the KB itself, the new version looks to be twice as fast. One test shows 7001ms vs. 3029ms. The binary search might not be responsible, since that seemed to make negligible difference: 3029ms vs. 2991ms. It might instead be the slower string comparisons in the old version. However, this pathalogical case is probably not the worst case. There are TrieNodes with up to 1854 children and I think they will slow down the linear search substantially. There are also NEs up to 350 tokens in length FWIW. If need be I can time the absolute worst case scenarios.
It seems to be difficult to compare memory usage in Java. If the size of the serialized LexiconNER can be used as an approximation, the old version comes in at around 40MB and the new one at 18MB. The majority is for the string map. Some 45% of NEs are one token only and 49% are two tokens.
Up to this point, the LexiconNER does not have to be changed much. I added some interfaces and one function call (finishAdding) to convert the TrieNode from one used for adding new entries to one for finding the NEs. The next optimization does require some modification, though.
It is possible to combine all the separate HashTries into one large hash trie. Instead of each TrieNode storing a boolean for completePath, it stores an integer which specifies which matcher it came from. LexiconNER goes through its list of matchers as it gets built and passes the index that the matcher would have (if it really existed anymore). That gets put into the TrieNode. When a NE is eventually found, that index gets returned and it is used to create (or look up) the label, which is really the only part of the individual matchers that is left.
The rough draft of this shows it being around 5.3 times faster (3029ms vs. 574ms) on the pathological data. Right now it's hacked into the existing code. The final version would require modifying more of the code (which I am reluctant to do without this report) or making a separate class to use instead.
Modifications made one at a time, making it difficult to see the overall picture, especially with all the test code, are in the branch kwalcock-HashTrie. A squashed version which skips intermediate steps is soon to be kwalcock-CombinedCompactHashTrie. Remember that it is not complete.
Chiming in to just say that I'm excited to see efficiency improvements to LexiconNER
. As an anecdote, JFR shows that, after parsing, LexiconNER
operations are where we're taking the biggest hit in our project.
@kwalcock: any updates on this? Thanks! (Remember that the key test here is to make sure that unit tests in Reach pass)
No. The changes to date have mostly preserved the current functionality and added optional relatively minor speedups or compression. The next change, combining trees into one, which is in rough draft here, requires some internal changes to LexiconNER that would more so replace current functionality. So does LexiconNER need to become OldLexiconNER and everything gets the updated LexiconNER? Or is there a NewLexiconNER that can optionally be used? There are related choices to be made and it might help to discuss them.
Maybe we should:
1 sounds good 2 & 3 implies keep both versions, OK It's more LexiconNER => SeparatedLexiconNER and a new CombinedLexiconNER. Either one or both can have regular HashTrie storage behind it and/or CompactHashTrie. The tries have to be slightly different for the Separated and Combined versions. The separated one keeps a Boolean for completePath and the combined one keeps an Int for which Lexicon completes the path. That's a slightly troubling number of combinations, but OK.
Do you have a better idea in mind?
@MihaiSurdeanu, FWIW I worked on this again last week. It's in the testing stage now. It's contained in a branch kwalcock-DuoHashTrie now, mostly for backup purposes and for the curious. Quite a bit had to be revised for proper integration, so I changed branches.
Thank you!
Just FYI, if you're still using Reach: it is very dependent on being able to read and use Bioresources through Processors and will probably have to be modified/updated for these changes, too.
Thanks @hickst. It has been tested quite a bit with the BioCluProcessor and is being tested with reach right now before eventually merging.
One recent sets of runs on pathological data gave the times below. Data was the first 5MB of the very knowledgebase files that go into the LexiconNER in the first place. They are from the bioresources project in the src/main/resources/org/clulab/reach/kb directory. They are pathological because the density of NERs is much higher than would be in normal text. This in turn makes these numbers probably look better than they would otherwise be.
Original: 13,761ms Separated: 10,057ms (this is same design as original, just streamlined a bit) Combined: 5,177ms (a single trie instead of multiple tries) Compact: 2,450ms (no longer really a trie, just arrays)
That's around 5.6 times as fast as the original.
Very nice!
From: Keith Alcock notifications@github.com Sent: Tuesday, February 26, 2019 7:47:43 PM To: clulab/processors Cc: Mihai Surdeanu; Mention Subject: Re: [clulab/processors] LexiconNER (and HashTrie) needs to be smaller, faster (#302)
One recent sets of runs on pathological data gave the times below. Data was the first 5MB of the very knowledgebase files that go into the LexiconNER in the first place. They are from the bioresources project in the src/main/resources/org/clulab/reach/kb directory. They are pathological because the density of NERs is much higher than would be in normal text. This in turn makes these numbers probably look better than they would otherwise be.
Original: 13,761ms Separated: 10,057ms (this is same design as original, just streamlined a bit) Combined: 5,177ms (a single trie instead of multiple tries) Compact: 2,450ms (no longer really a trie, just arrays)
That's around 5.6 times as fast as the original.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/clulab/processors/issues/302#issuecomment-467702197, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABH-zjlCejEpPKLX6AOJN3Wl5sA79lVrks5vRfHPgaJpZM4aiN3N.
@kwalcock , can this be closed?
Yes. Perhaps some other part is the slowpoke now. Thanks.
Mihai wrote this once upon a time:
Also, on a different topic: processors has gotten pretty fat. Some of it is out of our control (CoreNLP models), but some of it is our stuff. In particular, we use a lot the LexiconNER class, which stores its internal data (basically multi-word dictionaries) as a HashTrie:
https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/struct/HashTrie.scala
This is a class that I implemented a while ago, and I am fairly sure it’s sub-optimal. At some point I’d like to brainstorm with you about optimizing it…