apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.62k stars 1.02k forks source link

Hunspell very high memory use when loading dictionary [LUCENE-5468] #6531

Closed asfimport closed 10 years ago

asfimport commented 12 years ago

Hunspell stemmer requires gigantic (for the task) amounts of memory to load dictionary/rules files. For example loading a 4.5 MB polish dictionary (with empty index!) will cause whole core to crash with various out of memory errors unless you set max heap size close to 2GB or more. By comparison Stempel using the same dictionary file works just fine with 1/8 of that (and possibly lower values as well).

Sample error log entries: http://pastebin.com/fSrdd5W1 http://pastebin.com/Lmi0re7Z


Migrated from LUCENE-5468 by Maciej Lisiewski, 1 vote, resolved Feb 27 2014 Attachments: LUCENE-5468.patch, patch.txt

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

No, but when testing relevance, outputting all the stems leads to slower indexing, a much larger index, and significantly impacts precision for some languages.

So after reading CLEF experiments done with hungarian (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.8036&rep=rep1&type=pdf) where they suggest a simple disambiguation heuristic (shortest stem for most aggressive), i experimented with the opposite, and found it was quite useful :)

asfimport commented 10 years ago

Chris Male (migrated from JIRA)

Awesome, sounds like a great addition then.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

These are incredible reductions on RAM usage from cutting over to FSTs. And it's nice that you are using IntSequenceOutputs, and that you are now able to load dictionaries that failed before!

I'm not sure it matters here, but do you handle the FST Builder returning null for the built FST (when there was nothing added)? Just a common gotchya...

Do you have any sense of how the lookup speed changed?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not sure it matters here, but do you handle the FST Builder returning null for the built FST (when there was nothing added)? Just a common gotchya...

Do you have any sense of how the lookup speed changed?

Many dictionaries have either no prefixes or no suffixes: the code comment below this also answers your other question about NULL FST I think. Admittedly its probably no faster now, but it can be faster if we make the Stemmer smarter when walking the possibilities in the word:

  // TODO: this is pretty stupid, considering how the stemming algorithm works
  // we can speed it up to be significantly faster!
  IntsRef lookupAffix(FST<IntsRef> fst, char word[], int offset, int length) {
    if (fst == null) {
      return null;
    }

Given the fact this thing takes sometimes 100MB per field and makes it nearly unusable, I made such larger changes a TODO for a separate issue?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Definitely +1 to commit this and worry about speedups separately!

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1572754 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1572754

LUCENE-5468: reduce RAM usage of hunspell

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1572774 from @rmuir in branch 'dev/branches/branch_4x' https://svn.apache.org/r1572774

LUCENE-5468: reduce RAM usage of hunspell

asfimport commented 10 years ago

Lukas Vlcek (migrated from JIRA)

Amazing improvement!

While we are on Hunspell I would like to make a proposal for additional enhancements but first I would like to ask if you would be interested in seeing such improvements in the code. I would be happy to open a new ticket for this in such case.

1) AFAIR Hunspell token filter has an option to setup level of recursion. Originally hardcoded to 2 if I am not mistaken. But the level of recursion counts for both prefix and postfix rules - meaning if it is set to 2 and 1 prefix rule is applied, then we can only apply 2-1 suffix rules. What I would like to propose is adding an option to explicitly specify recursion level for both the prefix and for postfix rules. This probably depends a lot on how the affix rules are constructed but I can clearly see this would help in case of Czech dictionary - hopefully this might be found useful for other languages too.

2) Case sensitivity is a tricky part. Czech dictionary is case sensitive and it can deliver very nice results but users can not always fully benefit from this. The biggest problem I remember are tokens at the beginning of sentences. They start with capitals and thus they may not be found in dict where only lowercased variation is recorded. I was thinking that one useful solution to this issue can be adding an option to lowercase given token if it hasn't been found in dict and making a second pass through the filter again with lowercased token (it is costly but would be only optional so user is the one to decide if this is worth the indexing time).

3) Also it would be really useful if Hunspell token filter provided an option to output all terms that are the result of application of relevant rules to input token (so in essence quite opposite transformation to what is used during stemming). Such functionality would be useful if users want to add custom extension to existing dictionary (having an option to load several dict files is really useful IMO) and they want to check that they constructed valid rules for specific words. Having Lucene directly supporting them via exposed API would be great I think (especially when thinking about later applications in Solr and Elasticsearch).

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

All 3 of these options are still supported by both the filter/dictionary and the factory. look at 'recursionCap', 'ignoreCase', and dictionaries is a List<InputStream>. And by default it outputs all terms (unless you supply longestMatch=true). So I'm not really sure what is needed here?

asfimport commented 10 years ago

Lukas Vlcek (migrated from JIRA)

Robert,

I did not check the latest code so please forgive my ignorance but let me try to explain:

recursionCap does not distinguishes between how many prefix and suffix rules were applied. Does it? It counts just the total. If I set recursionCap to 1 it actually includes all the following options:

This may not play well with some affix rule dictionaries. For example the czech dictionary is constructed in such a way that only one suffix rule can be applied otherwise the filter can generate irrelevant tokens. So the recursionCap MUST be set to 0. However, if this recursion level is consumed on removal of prefix then it can not continue and manipulate also the suffix. So what I am proposing is having an option to set recursionCap separately for prefix and suffix. In case of czech dict I would say: you can apply only one prefix rule and only one suffix rule (meaning you can NEVER apply two prefix rules or two affix rules).

As for ignoreCase - how does it work if the dictionary contains terms like "Xx" and "xx" and each is allowed to use different set of rules? I need to distinguish between them. But at the same hand if the dictionary contains only "yy" but I get "Yy" as input (because it was the first word of the sentence) would it be able to process it correctly and still distinguish between "Xx" and "xx"?

As for the last feature I probably confused you. What I am looking for is not output of all possible root words for given term but all possible inflections for given (root) word. For example: input is "tell" and based on loaded dictionary the output would be ["tell","tells","telling", ...]. I think it wold not be hard to expose such API and I believe users would appreciate this when constructing custom dictionaries (I tried that and I was missing such feature, sure I can implement it myself but I believe having it in Solr and Elasticsearch would be great, definitely this is not useful for indexing process but as a part of tuning your dictionary this would be helpful).

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok, i was a little confused. I thought perhaps you referred to the previous discussion above about removing things :)

I just want to make it clear i kept all the additional options we already had!

So what I am proposing is having an option to set recursionCap separately for prefix and suffix. In case of czech dict I would say: you can apply only one prefix rule and only one suffix rule (meaning you can NEVER apply two prefix rules or two affix rules).

+1, can you open an issue for this?

As for ignoreCase - how does it work if the dictionary contains terms like "Xx" and "xx" and each is allowed to use different set of rules? I need to distinguish between them.

Right, thats why it does nothing by default :)

But at the same hand if the dictionary contains only "yy" but I get "Yy" as input (because it was the first word of the sentence) would it be able to process it correctly and still distinguish between "Xx" and "xx"?

In my opinion, this is not the responsibility of this filter (it simply has ignoreCase on or off). This has more to do with your analysis chain? So if you want to put a lowercase filter first always, thats one approach. If you want to use some rule/heuristic for sentence tokenization or other fancy stuff, you can selectively lowercase and get what you want. But this filter knows nothing about that :)

I think it wold not be hard to expose such API and I believe users would appreciate this when constructing custom dictionaries (I tried that and I was missing such feature, sure I can implement it myself but I believe having it in Solr and Elasticsearch would be great, definitely this is not useful for indexing process but as a part of tuning your dictionary this would be helpful).

Why not just use the hunspell command-line tools like 'unmunch', 'analyze', etc for that?

asfimport commented 10 years ago

Chris Male (migrated from JIRA)

I dont think we should make the recusionCap anymore complex. I put it in there simply to prevent languages from getting into infinite loops.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Chris but Lukas has a real use-case and its probably like 5 total lines of code to split that out? I dunno, it seems fine to me.

asfimport commented 10 years ago

Chris Male (migrated from JIRA)

Yeah I guess. We can go over that in a new issue.

asfimport commented 10 years ago

Lukas Vlcek (migrated from JIRA)

OK, I will open a new ticket for the recursionCap tomorrow (it is late on my end now).

Just a real quick comments on my two other suggestions:

Lowercasing in Hunspell - Robert, when you think about it there is really no simple solution to this using existing Lucene analysis flow AFAIK. If you apply lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased token (if there is any record for it in the dictionary). If you apply if after Hunspell you have the problem with first token in sentences (in most cases). The other option is (as you mentioned) employ some more sophisticated analysis chain (but is there any suitable in Lucene out of the box or do I have to go down the road and setup complex language library or framework?) So the option to allow lowercasing for second pass is IMO nice compromise that can help a lot with really minimal effort (and it is also easy to explain to users what it does and when to use it). It is not perfect solution but may be good enough to solve 80/20 principle.

Getting all inflections - yes, there are CL tools for this. But this is really more about user experience comfort, and again, it is easy to explain how to use it, what it does and users do not have to mess with CL tools and things like that. Not sure how hard it would be to implement this with what is in Hunspell now. Also one thing is some CL tool used against some dictionary files and other thing can be using Lucene code on dictionary loaded into memory by Lucene. If there are issues in the code these two approaches can give different results (yes, they should be the same...)

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Lowercasing in Hunspell - Robert, when you think about it there is really no simple solution to this using existing Lucene analysis flow AFAIK. If you apply lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased token (if there is any record for it in the dictionary). If you apply if after Hunspell you have the problem with first token in sentences (in most cases). The other option is (as you mentioned) employ some more sophisticated analysis chain (but is there any suitable in Lucene out of the box or do I have to go down the road and setup complex language library or framework?) So the option to allow lowercasing for second pass is IMO nice compromise that can help a lot with really minimal effort (and it is also easy to explain to users what it does and when to use it). It is not perfect solution but may be good enough to solve 80/20 principle.

There may not be, but its about where the responsibility should be. Its more than the first token in sentences: named entities etc are involved too. If you want to get this right, yes, you need a more sophisticated analysis chain! That being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20 it is :)

Getting all inflections - yes, there are CL tools for this. But this is really more about user experience comfort, and again, it is easy to explain how to use it, what it does and users do not have to mess with CL tools and things like that. Not sure how hard it would be to implement this with what is in Hunspell now. Also one thing is some CL tool used against some dictionary files and other thing can be using Lucene code on dictionary loaded into memory by Lucene. If there are issues in the code these two approaches can give different results (yes, they should be the same...)

On this one i honestly do disagree. I dont mean to sound rude, but if you are smart enough to make a custom dictionary, I don't think I need to baby such users around and make them comfortable by duplicating command line tools they can install themselves in java :) The tools provided by hunspell are the best here, and if someone is making a custom dictionary they already need to be digging into these tools/docs to know what they are doing. I don't see a value in duplicating this stuff and providing morphological generation and other super-advanced esoteric stuff, when there are more basic things needed (like decomposition). As far as if things differ, then those are bugs that should be fixed...

asfimport commented 10 years ago

Lukas Vlcek (migrated from JIRA)

Hi Robert,

I created a new ticket #6547 for distinct recursion levels per pre/suffix rules.

There may not be, but its about where the responsibility should be. Its more than the first token in sentences: named entities etc are involved too. If you want to get this right, yes, you need a more sophisticated analysis chain! That being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20 it is :)

I fully understand YPOW. The question of responsibility is important. But if I consider that workaround like lowercasing for optional second pass could be easier than telling user to setup complicated analysis chain (or employ external system) then I believe it might make sense to do a qualified exception. :) Heh...

But seriously. How about if I open a ticket for this to allow to fly this idea around. WDYT?

I would like to try to implement it as well (if no one else will do it) though I will not get to it soon. As for the 80/20 aspect the good thing about this feature is that it could be measured (precision, recall, ...). And may be only implementing this feature cold tell us if it is useful or not.

[...] if you are smart enough to make a custom dictionary, I don't think I need to baby such users around and make them comfortable by duplicating command line tools they can install themselves in java [...]

Short: I agree

Long: Creating a new dictionary is very hard. It is for wizards... but the thing here Robert is that creating a new dictionary from scratch is something completely different than extending existing dictionary. At least for average users (like me), they probably can hardly do the former but can relatively easy do the later. The former involves creating the affix rules, the later means using given affix rules and build on top of it.

When I was trying to extend existing dictionary then in fact I had to do the following: 1) identify words that were missing in the dict file (or files) 2) assigning some of existing rules to each of them 3) verify #2 was done right

As for 1) that is easy (the only trick when creating a new file with missing words is to stick to encoding defined in aff file) As for 2) that is harder but in my case I was building on top of relatively large dictionary so I could bet on the fact that language morphology has been already cover well in affix rules (so I assumed I was not introducing words with new/unique morphology to the dictionary). So in fact instead of trying to understand the rules (see my note about this below) I searched for words that should have similar morphology features and used their rules (for example if I were to add a word "fail" I would search for "sail" and use the same rules). As for 3) this in fact means expanding token in root form according to all possible valid rules and check it all makes sense. As you pointed out, there are CL tools for this but I simply did not want to learn them (I did not feel like a wizard). And the good question is if Lucene should be able to provide API that could be used for this task. In the end of the day Lucene is said to be a IR library and has language analysis capabilities, so why not? But I am fine to leave this feature out now. Just wanted to explain some of my motivations for this feature.

Note: As for understanding the affix rules - this is probably complex topic and I did not have a time to dig deep enough to say anything qualified about it yet. However, as far as I understand various \*spell systems have various limitations. For example in case of the Czech dictionary, it is an ispell which allowed only limited number of affix rules (that is what I understood from conversation with an author of Czech dictionary). Which means that if the number of rules is limited then what we see being shipped in aff file is more a result of some preprocessing that takes set of rules that are understandable to human and produces more compact set that might not be easily understood by humans.

But this is unrelated topic, except that it can illustrate the situation of average user who just want to add some new words into existing dictionary and do not have the capacity to become an expert on ispell (or myspell, or aspell, ... or ... you name it).

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I fully understand YPOW. The question of responsibility is important. But if I consider that workaround like lowercasing for optional second pass could be easier than telling user to setup complicated analysis chain (or employ external system) then I believe it might make sense to do a qualified exception.

This responsibility is really important though. Maybe you should break away from the czech dictionary and look at the others before you decide that its "easiest" here. For example the german dictionary has lots of complex casing rules encoded in the affix file itself for decompounding purposes. This feature already is plenty complicated. If you can do ANYTHING and I mean ANYTHING outside of it in any way, we should keep it out of here.

As you pointed out, there are CL tools for this but I simply did not want to learn them (I did not feel like a wizard). And the good question is if Lucene should be able to provide API that could be used for this task. In the end of the day Lucene is said to be a IR library and has language analysis capabilities, so why not? But I am fine to leave this feature out now. Just wanted to explain some of my motivations for this feature.

Because its an IR library, not a tool for building lexical resources. We just dont have the resources to "compete" with that, we don't have people that need it, and why waste our time when there are perfectly good tools available? I don't know why you refuse to "learn" the hunspell tools, they are trivial to learn!

Besides the commandline tools, quick searches reveal GUI tools too, such as http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html. Quote from the page: "My tool is so intuitive that even a 6-year-old kid can use it."

I don't think such work should be duplicated inside the apache lucene project.

asfimport commented 10 years ago

Lukas Vlcek (migrated from JIRA)

Good points, I did not know that about the german dictionary. In this perspective my suggestion sounds really hack-ish and should be left out.

[...] even a 6-year-old kid can use it.

I am always amazed about what kids these days can achieve...

Thanks for your time Robert!

asfimport commented 10 years ago

Torsten Krah (migrated from JIRA)

Just for interest - are multiple dictionaries still supported with this change (after reading all comments its not clear if it was dropped or not)? This option is nice to have, because you can make local modifications and can update the main dictionary from upstream (libreoffice etc.) without need for merging or something. If not - is there already a ticket to get this working again?

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Close issue after release of 4.8.0