Hunspell very high memory use when loading dictionary [LUCENE-5468]

asfimport commented 12 years ago

Hunspell stemmer requires gigantic (for the task) amounts of memory to load dictionary/rules files. For example loading a 4.5 MB polish dictionary (with empty index!) will cause whole core to crash with various out of memory errors unless you set max heap size close to 2GB or more. By comparison Stempel using the same dictionary file works just fine with 1/8 of that (and possibly lower values as well).

Sample error log entries: http://pastebin.com/fSrdd5W1 http://pastebin.com/Lmi0re7Z

Migrated from LUCENE-5468 by Maciej Lisiewski, 1 vote, resolved Feb 27 2014 Attachments: LUCENE-5468.patch, patch.txt

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

By comparison Stempel using the same dictionary file works just fine with 1/8 of that (and possibly lower values as well).

I imagine Stempel's Trie is good, but have you also compared Morfologik (http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/morfologik/) ? Its precompiled FST might be the most space-efficient for polish.

But really I think Hunspell's dictionary structure should be more efficient, we could build the FST on-the-fly (if case-insensitive mode is off). But when this is on, entries must be merged.

Instead it might be better for the hunspell stuff to support loading FSTs (where we would do any case-sensitivity tweaking/merging of entries, then build FST). It might be possible to re-use some of the same code from SOLR-2888 that does a similar thing to build a suggester FST.

In my opinion its worth it to build the FST not just for the words, but also the affixes (in some files these are humungous too!)

For lucene I think we would just allow HunspellDictionary to also be instantiated from these FST inputstreams. The solr factory / configuration would need to be tweaked to make this easy and intuitive.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Morfologik will be exactly the same size in memory as its unzipped dictionary, so about 1.8MB + 3.5MB if you use both pl (morfologik) and pl-sgjp (morfeusz) dictionaries. These are fixed dictionaries (that is unknown words won't be stemmed) but the coverage is decent for contemporary Polish.

If you explain what you're trying to do/ achieve then perhaps we'll be able to give you some more hints.

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

+1 to your idea Robert. I've been thinking along the same lines that FSTs might help us out here.

asfimport commented 12 years ago

Maciej Lisiewski (migrated from JIRA)

The last time I checked Morfologik was just mentioned as a possible new stemmer - I have used it before and I prefer it to Stempel/Hunspell, so I guess this solves my problem for now, thanks :-)

As for Hunspell IMHO 2GB heap just to load dictionary makes it borderline unusable for some languages.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

As for Hunspell IMHO 2GB heap just to load dictionary makes it borderline unusable for some languages.

Right but honestly the original motivation was to get something up quickly when you have no other choice: for minority languages, etc.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

You know what they say these days – just buy more ram and get rid of the problem by covering it with money :)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

yeah but the HunspellDictionary really is ridiculous if you try to use a large dictionary with it, even without cutting over to an FST it could probably be improved.

for minority languages without really nice dictionaries it probably doesnt matter much, but for the languages with really nice dictionaries you also tend to have language-specific options available.

just another crazy idea: I don't know how much of morfologik is dependent upon polish itself, but if it already knows how to compile ispell/hunspell into an efficient form and work with it, maybe we should just be seeing if we can 'generalize' that and work it from that angle.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I must disappoint you here – morfologik simply compiles a list of inflected-base-tag triples, it has no logic for generating these forms from lexical flags/ base dictionaries. Marcin Miłkowski has a set of scripts for that and he, as far as I recall, used aspell/ ispell to "dump" all of their forms by feeding the input dictionary basically. I think hunspell provides more intelligent handling of words outside of the dictionary so there's value in it that morfologik doesn't have.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Marcin Miłkowski has a set of scripts for that and he, as far as I recall, used aspell/ ispell to "dump" all of their forms by feeding the input dictionary basically. I think hunspell provides more intelligent handling of words outside of the dictionary so there's value in it that morfologik doesn't have.

I think what you describe is essentially at a highlevel exactly what the hunspellfilter does. Theoretically there is more intelligent handling possible (correcting spelling), but this isn't implemented, not interesting for search anyway for the most part, and there is definitely no OOV mechanism.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

You're probably right – my opinion was based on my inspection of hunspell's source code that I did once or twice in the past – I remember there's logic to perform more advanced stuff than dictionary lookup, but I never got the full picture if or how it's used.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm working on a quick 80/20 stab here. I think it will help a lot.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here's a patch cutting this thing over to use less ram once its started. but it probably uses more initially when parsing, mainly because we cannot guarantee the input is in sorted order. I think we should fix that, so that jumping thru hoops is the exception rather than the rule:

we allow multiple dictionary files... is this really needed?
if you use ignoreCase it means entries can be out of sorted order too.
in some strange encodings the order in the original file could differ from binary order.

the building could just do the 2-phase thing it does now for the crazy cases and be efficient for the 90% case if we clean up.

The remaining problems:

fix existing confusion in the dictionary api (like multiple input files) so that most of the time we can rely upon sorted order.
solr should never instantiate more than one of the same dictionary across different fields (thats a factory issue, i'm not going to deal with it here, but its just stupid if the factory does this).
anything in the patch with nocommit, TODO, or bogus should be fixed.

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

Hey, patch looks cool Robert.

we allow multiple dictionary files... is this really needed?

I don't think so.

solr should never instantiate more than one of the same dictionary across different fields (thats a factory issue, i'm not going to deal with it here, but its just stupid if the factory does this)

Thats a really good point actually. Makes me wonder whether there are other files / datastructures in analysis factories that are in the same boat?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Makes me wonder whether there are other files / datastructures in analysis factories that are in the same boat?

Maybe synonyms too? I dunno, just seems like if factories implement ResourceLoaderAware, instead of calling init() and inform() on all of them, instead they should be able to parse their params in init(), override equals/hashcode based on their parameters, and some mechanism would just then reuse existing ones instead of creating duplicates.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Looks good to me from looking at the diff. Btw., we really should pull out the getOutputForInput(FST, input) logic currently present in lookupOrd somewhere where it's reusable – I've seen it in a few places (or needed it a few times)...

asfimport commented 12 years ago

Jan Høydahl (@janhoy) (migrated from JIRA)

Background for supporting multiple dictionaries is here: http://code.google.com/p/lucene-hunspell/issues/detail?id=4 and is invaluable for adding local customizations or overrides without touching the official dictionaries.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

at least the local override/customizations files can surely require sorted order?

asfimport commented 12 years ago

Maciej Lisiewski (migrated from JIRA)

Those overrides/customizations will be tiny when compared to the main dictionary - is the sorting really an issue here? Simple example: default PL dictionary is close to 200k words. Largest custom dictionaries (legal, military, medical) will be 5-10k words (I'm basing those estimates on the best sources that I have found to generate those dictionaries from). In most cases we should expect <1k words.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Those overrides/customizations will be tiny when compared to the main dictionary - is the sorting really an issue here?

Doesn't matter here, our FST requires that it be built in-order. doesn't matter if even one single word is out of order.

because of this, we can't build the data structure efficiently.

asfimport commented 12 years ago

Maciej Lisiewski (migrated from JIRA)

What I was trying to say is that the custom dictionaries are small enough to be loaded and sorted in memory before building FST.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Also, its required by the hunspell format itself. So this is not crazy to enforce.

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

I don't see any problem mandating that overrides/customizations adhere to a sorted order. I don't think we can assume custom dictionaries are going to be small - there's nothing in the APIs which force that. Using FSTs gives us the performance benefit we're seeking in this issue, I think the small sacrifice is worth the huge benefit.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

You can always sort inside your application if you're not sure if the words come or not in sorted order, Maciej. Lucene/Solr now even has on-disk merge sort which you can use for large(r) data sets – this code is along FSTCompletion in trunk.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

note: in some cases we will still have to use the throwaway treemap or similar like the patch i uploaded does.

but we could then know these two cases up front:

someone enables ignoreCase=true
when binary sort order of the charset != utf8 binary order

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

You can always sort inside your application if you're not sure if the words come or not in sorted order, Maciej

Well someone has to sort to 'test' any dictionary customizations with hunspells tools anyway.

So i assume people are already doing 'sort foo.dic my_foo_customizations.dic > combined.dic' then using 'analyze' and other commands to test... otherwise how are they testing their customizations ?!

asfimport commented 11 years ago

Maho NAKATA (migrated from JIRA)

Dictionaries with the same file location should be shared across all field and all indexes. This would minimize the problem if you're using multiple indexes.

Currently I can't use Solr because I have 10 indexes with 5 field and for each field a DictionaryCompoundWordTokenFilterFactory is assigned. So the dictionary will be loaded 50 times. This is too much for my RAM.

asfimport commented 10 years ago

Maho NAKATA (migrated from JIRA)

I now solved the problem in my special case. I wrote a custom TokenFilterFactory that wraps the DictionaryCompoundWordTokenFilterFactory / HunspellStemFilterFactory and caches the factories, so they will be reused across indexes and fieldtypes.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think we should let some esoteric options like multiple dictionaries keep this stuff unusable.

So I'm happy to just fork the entire stuff into a different package (hunspell2 or something), so we have a reasonably efficient version that doesnt have these esoteric options. The old stuff can stay as is, i do not care.

asfimport commented 10 years ago

Chris Male (migrated from JIRA)

Multiple dictionaries was never in the original design either. Having an efficient and usable design seems to be of higher priority so +1 to not forking and doing this in place.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Well, I don't want the whole issue to get hung up on that stuff. Basically i'm working on a number of changes (especially tests though, to ensure the stuff is really working correctly). If we want, we can just lay down my new files on top of the existing stuff, or we can keep it/deprecate it, whatever we want to do.

I just want to make some progress on a few improvements I've been investigating to try to make this thing more usable :)

asfimport commented 10 years ago

Chris Male (migrated from JIRA)

Sounds good

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1571137 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571137

LUCENE-5468: commit current state

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I brought the previous FST patch up to speed, and then built a test to parse many dictionaries and compare memory. When it says FAIL, thats because the current code can't parse the dictionary (i fixed all the issues here).

In general, RAM use is better, but in some cases its still bad because of how the affixes are represented. I still havent removed my Treemap yet either (i wanted to have a way to test all the dictionaries like this before really locking things down).

dict	old RAM	new RAM
af_ZA.zip	18 MB	899 KB
ak_GH.zip	1.5 MB	71 KB
bg_BG.zip	FAIL	1.1 MB
ca_ANY.zip	28.9 MB	1.2 MB
ca_ES.zip	15.1 MB	1.2 MB
cop_EG.zip	2.1 MB	489.3 KB
cs_CZ.zip	50.4 MB	2.8 MB
cy_GB.zip	FAIL	1.6 MB
da_DK.zip	FAIL	750.8 KB
de_AT.zip	1.3 MB	293.1 KB
de_CH.zip	12.6 MB	895.6 KB
de_DE.zip	12.6 MB	895 KB
de_DE_comb.zip	102.2 MB	4.8 MB
de_DE_frami.zip	20.9 MB	1.2 MB
de_DE_neu.zip	101.5 MB	4.8 MB
el_GR.zip	74.3 MB	1.1 MB
en_AU.zip	8.1 MB	1.2 MB
en_CA.zip	9.8 MB	436.7 KB
en_GB-oed.zip	8.2 MB	1.2 MB
en_GB.zip	8.3 MB	1.2 MB
en_NZ.zip	8.4 MB	1.2 MB
eo.zip	4.9 MB	1.3 MB
eo_EO.zip	4.9 MB	1.3 MB
es_AR.zip	14.8 MB	3.9 MB
es_BO.zip	14.8 MB	3.9 MB
es_CL.zip	14.7 MB	3.9 MB
es_CO.zip	14.3 MB	3.8 MB
es_CR.zip	14.8 MB	3.9 MB
es_CU.zip	14.7 MB	3.9 MB
es_DO.zip	14.7 MB	3.9 MB
es_EC.zip	14.8 MB	3.9 MB
es_ES.zip	15.1 MB	4.1 MB
es_GT.zip	14.8 MB	3.9 MB
es_HN.zip	14.8 MB	3.9 MB
es_MX.zip	14.3 MB	3.8 MB
es_NEW.zip	15.5 MB	4.2 MB
es_NI.zip	14.8 MB	3.9 MB
es_PA.zip	14.8 MB	3.9 MB
es_PE.zip	14.2 MB	3.8 MB
es_PR.zip	14.7 MB	3.9 MB
es_PY.zip	14.8 MB	3.9 MB
es_SV.zip	14.8 MB	3.9 MB
es_UY.zip	14.8 MB	3.9 MB
es_VE.zip	14.3 MB	3.8 MB
et_EE.zip	53.6 MB	5.9 MB
fo_FO.zip	18.6 MB	485.7 KB
fr_FR-1990_1-3-2.zip	14 MB	636.4 KB
fr_FR-classique_1-3-2.zip	14 MB	743.1 KB
fr_FR_1-3-2.zip	14.5 MB	755.2 KB
fy_NL.zip	4.2 MB	272.8 KB
ga_IE.zip	14 MB	674.8 KB
gd_GB.zip	2.7 MB	111 KB
gl_ES.zip	FAIL	1.2 MB
gsc_FR.zip	FAIL	1.4 MB
gu_IN.zip	20.3 MB	914.9 KB
he_IL.zip	53.3 MB	1.8 MB
hi_IN.zip	2.7 MB	136.9 KB
hil_PH.zip	3.4 MB	164.8 KB
hr_HR.zip	29.7 MB	564.8 KB
hu_HU.zip	FAIL	17.6 MB
hu_HU_comb.zip	FAIL	19.9 MB
ia.zip	4.9 MB	211.9 KB
id_ID.zip	3.9 MB	218.4 KB
it_IT.zip	15.3 MB	1.6 MB
ku_TR.zip	1.6 MB	147.6 KB
la.zip	5.1 MB	2.5 MB
lt_LT.zip	15 MB	2.8 MB
lv_LV.zip	36.3 MB	1.9 MB
mg_MG.zip	2.9 MB	131.7 KB
mi_NZ.zip	FAIL	171.2 KB
mk_MK.zip	FAIL	436.9 KB
mos_BF.zip	13.3 MB	210 KB
mr_IN.zip	FAIL	115.5 KB
ms_MY.zip	4.1 MB	221.6 KB
nb_NO.zip	22.9 MB	1.4 MB
ne_NP.zip	5.5 MB	495.6 KB
nl_NL.zip	22.9 MB	1.1 MB
nl_med.zip	1.2 MB	60.2 KB
nn_NO.zip	16.5 MB	1 MB
nr_ZA.zip	3.1 MB	171.1 KB
ns_ZA.zip	1.7 MB	85.8 KB
ny_MW.zip	FAIL	69.6 KB
oc_FR.zip	9.1 MB	690.5 KB
pl_PL.zip	43.9 MB	4.9 MB
pt_BR.zip	FAIL	3.9 MB
pt_PT.zip	5.8 MB	773.4 KB
ro_RO.zip	5.1 MB	226.2 KB
ru_RU.zip	21.7 MB	1.4 MB
ru_RU_ye.zip	43.7 MB	1.6 MB
ru_RU_yo.zip	21.7 MB	1.4 MB
rw_RW.zip	1.6 MB	70.1 KB
sk_SK.zip	25.1 MB	2.3 MB
sl_SI.zip	38.3 MB	806.6 KB
sq_AL.zip	28.9 MB	654.6 KB
ss_ZA.zip	3.1 MB	176.3 KB
st_ZA.zip	1.7 MB	86.5 KB
sv_SE.zip	9.5 MB	668.8 KB
sw_KE.zip	6.3 MB	286 KB
tet_ID.zip	2 MB	92.4 KB
th_TH.zip	FAIL	377.4 KB
tl_PH.zip	2.6 MB	116.5 KB
tn_ZA.zip	1.5 MB	61.6 KB
ts_ZA.zip	1.6 MB	81 KB
uk_UA.zip	17.6 MB	3 MB
ve_ZA.zip	FAIL	108.8 KB
vi_VN.zip	1.7 MB	53.6 KB
xh_ZA.zip	3 MB	158.9 KB
zu_ZA.zip	24.5 MB	13.5 MB

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1571321 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571321

LUCENE-5468: factor OfflineSorter out of suggest

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1571356 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571356

LUCENE-5468: sort dictionary data with offline sorter

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1571788 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571788

LUCENE-5468: deduplicate patterns used by affix condition check

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1571802 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571802

LUCENE-5468: remove redundant 'append' in Affix

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1571807 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571807

LUCENE-5468: Stem -> CharsRef

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1571844 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571844

LUCENE-5468: make Affix fixed-width

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1572643 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572643

LUCENE-5468: don't create unnecessary objects

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1572660 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572660

LUCENE-5468: encode affix data as 8 bytes per affix, before cutting over to FST

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1572666 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572666

LUCENE-5468: convert affixes to FST

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I am finished compressing for now. I think its pretty reasonable across all the languages.

I will cleanup and try to add back the multiple dictionary/ignore case stuff and clean up some other things.

dict	old RAM	new RAM
af_ZA.zip	18 MB	917.1 KB
ak_GH.zip	1.5 MB	103.2 KB
bg_BG.zip	FAIL	465.7 KB
ca_ANY.zip	28.9 MB	675.4 KB
ca_ES.zip	15.1 MB	639.8 KB
cop_EG.zip	2.1 MB	144.5 KB
cs_CZ.zip	50.4 MB	1.5 MB
cy_GB.zip	FAIL	627.4 KB
da_DK.zip	FAIL	669.8 KB
de_AT.zip	1.3 MB	123.9 KB
de_CH.zip	12.6 MB	725.4 KB
de_DE.zip	12.6 MB	726 KB
de_DE_comb.zip	102.2 MB	4.2 MB
de_DE_frami.zip	20.9 MB	1023.5 KB
de_DE_neu.zip	101.5 MB	4.2 MB
el_GR.zip	74.3 MB	1 MB
en_AU.zip	8.1 MB	521 KB
en_CA.zip	9.8 MB	450.5 KB
en_GB-oed.zip	8.2 MB	526.6 KB
en_GB.zip	8.3 MB	527.3 KB
en_NZ.zip	8.4 MB	532.4 KB
eo.zip	4.9 MB	310.5 KB
eo_EO.zip	4.9 MB	310.5 KB
es_AR.zip	14.8 MB	734.9 KB
es_BO.zip	14.8 MB	735 KB
es_CL.zip	14.7 MB	734.9 KB
es_CO.zip	14.3 MB	722.1 KB
es_CR.zip	14.8 MB	733.9 KB
es_CU.zip	14.7 MB	732.8 KB
es_DO.zip	14.7 MB	731.9 KB
es_EC.zip	14.8 MB	733.5 KB
es_ES.zip	15.1 MB	743 KB
es_GT.zip	14.8 MB	734.5 KB
es_HN.zip	14.8 MB	735.2 KB
es_MX.zip	14.3 MB	723.8 KB
es_NEW.zip	15.5 MB	768.5 KB
es_NI.zip	14.8 MB	734.5 KB
es_PA.zip	14.8 MB	733.8 KB
es_PE.zip	14.2 MB	721.3 KB
es_PR.zip	14.7 MB	732.4 KB
es_PY.zip	14.8 MB	734.1 KB
es_SV.zip	14.8 MB	733.6 KB
es_UY.zip	14.8 MB	736.9 KB
es_VE.zip	14.3 MB	722.7 KB
et_EE.zip	53.6 MB	473.6 KB
fo_FO.zip	18.6 MB	517.9 KB
fr_FR-1990_1-3-2.zip	14 MB	526.7 KB
fr_FR-classique_1-3-2.zip	14 MB	539.2 KB
fr_FR_1-3-2.zip	14.5 MB	550.4 KB
fy_NL.zip	4.2 MB	265.6 KB
ga_IE.zip	14 MB	460.6 KB
gd_GB.zip	2.7 MB	143.1 KB
gl_ES.zip	FAIL	479.4 KB
gsc_FR.zip	FAIL	1.3 MB
gu_IN.zip	20.3 MB	947 KB
he_IL.zip	53.3 MB	539.2 KB
hi_IN.zip	2.7 MB	169 KB
hil_PH.zip	3.4 MB	197 KB
hr_HR.zip	29.7 MB	573 KB
hu_HU.zip	FAIL	1.2 MB
hu_HU_comb.zip	FAIL	5.4 MB
ia.zip	4.9 MB	222.9 KB
id_ID.zip	3.9 MB	226.3 KB
it_IT.zip	15.3 MB	612.9 KB
ku_TR.zip	1.6 MB	118.7 KB
la.zip	5.1 MB	199.3 KB
lt_LT.zip	15 MB	682.5 KB
lv_LV.zip	36.3 MB	763.9 KB
mg_MG.zip	2.9 MB	163.8 KB
mi_NZ.zip	FAIL	191.4 KB
mk_MK.zip	FAIL	469.1 KB
mos_BF.zip	13.3 MB	242.2 KB
mr_IN.zip	FAIL	147.7 KB
ms_MY.zip	4.1 MB	226.9 KB
nb_NO.zip	22.9 MB	1.2 MB
ne_NP.zip	5.5 MB	328.1 KB
nl_NL.zip	22.9 MB	1.1 MB
nl_med.zip	1.2 MB	92.3 KB
nn_NO.zip	16.5 MB	914 KB
nr_ZA.zip	3.1 MB	203.3 KB
ns_ZA.zip	1.7 MB	118 KB
ny_MW.zip	FAIL	101.8 KB
oc_FR.zip	9.1 MB	401.5 KB
pl_PL.zip	43.9 MB	1.7 MB
pt_BR.zip	FAIL	2.1 MB
pt_PT.zip	5.8 MB	379.4 KB
ro_RO.zip	5.1 MB	256.3 KB
ru_RU.zip	21.7 MB	882 KB
ru_RU_ye.zip	43.7 MB	1.5 MB
ru_RU_yo.zip	21.7 MB	897.3 KB
rw_RW.zip	1.6 MB	102.3 KB
sk_SK.zip	25.1 MB	1.2 MB
sl_SI.zip	38.3 MB	604 KB	af_ZA.zip	18 MB	917.1 KB
ak_GH.zip	1.5 MB	103.2 KB
bg_BG.zip	FAIL	465.7 KB
ca_ANY.zip	28.9 MB	675.4 KB
ca_ES.zip	15.1 MB	639.8 KB
cop_EG.zip	2.1 MB	144.5 KB
cs_CZ.zip	50.4 MB	1.5 MB
cy_GB.zip	FAIL	627.4 KB
da_DK.zip	FAIL	669.8 KB
de_AT.zip	1.3 MB	123.9 KB
de_CH.zip	12.6 MB	725.4 KB
de_DE.zip	12.6 MB	726 KB
de_DE_comb.zip	102.2 MB	4.2 MB
de_DE_frami.zip	20.9 MB	1023.5 KB
de_DE_neu.zip	101.5 MB	4.2 MB
el_GR.zip	74.3 MB	1 MB
en_AU.zip	8.1 MB	521 KB
en_CA.zip	9.8 MB	450.5 KB
en_GB-oed.zip	8.2 MB	526.6 KB
en_GB.zip	8.3 MB	527.3 KB
en_NZ.zip	8.4 MB	532.4 KB
eo.zip	4.9 MB	310.5 KB
eo_EO.zip	4.9 MB	310.5 KB
es_AR.zip	14.8 MB	734.9 KB
es_BO.zip	14.8 MB	735 KB
es_CL.zip	14.7 MB	734.9 KB
es_CO.zip	14.3 MB	722.1 KB
es_CR.zip	14.8 MB	733.9 KB
es_CU.zip	14.7 MB	732.8 KB
es_DO.zip	14.7 MB	731.9 KB
es_EC.zip	14.8 MB	733.5 KB
es_ES.zip	15.1 MB	743 KB
es_GT.zip	14.8 MB	734.5 KB
es_HN.zip	14.8 MB	735.2 KB
es_MX.zip	14.3 MB	723.8 KB
es_NEW.zip	15.5 MB	768.5 KB
es_NI.zip	14.8 MB	734.5 KB
es_PA.zip	14.8 MB	733.8 KB
es_PE.zip	14.2 MB	721.3 KB
es_PR.zip	14.7 MB	732.4 KB
es_PY.zip	14.8 MB	734.1 KB
es_SV.zip	14.8 MB	733.6 KB
es_UY.zip	14.8 MB	736.9 KB
es_VE.zip	14.3 MB	722.7 KB
et_EE.zip	53.6 MB	473.6 KB
fo_FO.zip	18.6 MB	517.9 KB
fr_FR-1990_1-3-2.zip	14 MB	526.7 KB
fr_FR-classique_1-3-2.zip	14 MB	539.2 KB
fr_FR_1-3-2.zip	14.5 MB	550.4 KB
fy_NL.zip	4.2 MB	265.6 KB
ga_IE.zip	14 MB	460.6 KB
gd_GB.zip	2.7 MB	143.1 KB
gl_ES.zip	FAIL	479.4 KB
gsc_FR.zip	FAIL	1.3 MB
gu_IN.zip	20.3 MB	947 KB
he_IL.zip	53.3 MB	539.2 KB
hi_IN.zip	2.7 MB	169 KB
hil_PH.zip	3.4 MB	197 KB
hr_HR.zip	29.7 MB	573 KB
hu_HU.zip	FAIL	1.2 MB
hu_HU_comb.zip	FAIL	5.4 MB
ia.zip	4.9 MB	222.9 KB
id_ID.zip	3.9 MB	226.3 KB
it_IT.zip	15.3 MB	612.9 KB
ku_TR.zip	1.6 MB	118.7 KB
la.zip	5.1 MB	199.3 KB
lt_LT.zip	15 MB	682.5 KB
lv_LV.zip	36.3 MB	763.9 KB
mg_MG.zip	2.9 MB	163.8 KB
mi_NZ.zip	FAIL	191.4 KB
mk_MK.zip	FAIL	469.1 KB
mos_BF.zip	13.3 MB	242.2 KB
mr_IN.zip	FAIL	147.7 KB
ms_MY.zip	4.1 MB	226.9 KB
nb_NO.zip	22.9 MB	1.2 MB
ne_NP.zip	5.5 MB	328.1 KB
nl_NL.zip	22.9 MB	1.1 MB
nl_med.zip	1.2 MB	92.3 KB
nn_NO.zip	16.5 MB	914 KB
nr_ZA.zip	3.1 MB	203.3 KB
ns_ZA.zip	1.7 MB	118 KB
ny_MW.zip	FAIL	101.8 KB
oc_FR.zip	9.1 MB	401.5 KB
pl_PL.zip	43.9 MB	1.7 MB
pt_BR.zip	FAIL	2.1 MB
pt_PT.zip	5.8 MB	379.4 KB
ro_RO.zip	5.1 MB	256.3 KB
ru_RU.zip	21.7 MB	882 KB
ru_RU_ye.zip	43.7 MB	1.5 MB
ru_RU_yo.zip	21.7 MB	897.3 KB
rw_RW.zip	1.6 MB	102.3 KB
sk_SK.zip	25.1 MB	1.2 MB
sl_SI.zip	38.3 MB	604 KB
sq_AL.zip	28.9 MB	581.7 KB
ss_ZA.zip	3.1 MB	208.5 KB
st_ZA.zip	1.7 MB	118.7 KB
sv_SE.zip	9.5 MB	535.4 KB
sw_KE.zip	6.3 MB	318.2 KB
tet_ID.zip	2 MB	124.5 KB
th_TH.zip	FAIL	409.6 KB
tl_PH.zip	2.6 MB	148.7 KB
tn_ZA.zip	1.5 MB	93.7 KB
ts_ZA.zip	1.6 MB	113.1 KB
uk_UA.zip	17.6 MB	979.1 KB
ve_ZA.zip	FAIL	140.9 KB
vi_VN.zip	1.7 MB	85.8 KB
xh_ZA.zip	3 MB	191.1 KB
zu_ZA.zip	24.5 MB	827.1 KB

sq_AL.zip	28.9 MB	581.7 KB
ss_ZA.zip	3.1 MB	208.5 KB
st_ZA.zip	1.7 MB	118.7 KB
sv_SE.zip	9.5 MB	535.4 KB
sw_KE.zip	6.3 MB	318.2 KB
tet_ID.zip	2 MB	124.5 KB
th_TH.zip	FAIL	409.6 KB
tl_PH.zip	2.6 MB	148.7 KB
tn_ZA.zip	1.5 MB	93.7 KB
ts_ZA.zip	1.6 MB	113.1 KB
uk_UA.zip	17.6 MB	979.1 KB
ve_ZA.zip	FAIL	140.9 KB
vi_VN.zip	1.7 MB	85.8 KB
xh_ZA.zip	3 MB	191.1 KB
zu_ZA.zip	24.5 MB	827.1 KB

asfimport commented 10 years ago

Chris Male (migrated from JIRA)

Those are some pretty amazing reductions, well done!

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I have the previous options added back too locally. so i will fix up tests and so on and just copy over the old filter and make a patch.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1572718 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572718

LUCENE-5468: hunspell2 -> hunspell (with previous options and tests)

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1572724 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572724

LUCENE-5468: fix precommit+test

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1572727 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572727

LUCENE-5468: add additional change

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think the change is ready. There are other improvements that can be done (for example, maybe an option for the factory to cache these things in case you use same ones across multiple fields, and more efficient affix handling against the FST, and so on), but it would be better on different issues I think?

Here is a patch (from diff-sources), sorry its not so useful, as I renamed some things. I tried making one from svn diff after reintegration, but it was equally useless. If you want you can also review my commits on this issue to the branch, too.

here is CHANGES entry:

API Changes:

LUCENE-5468: Move offline Sort (from suggest module) to OfflineSort. (Robert Muir)

Optimizations:

LUCENE-5468: HunspellStemFilter uses 10 to 100x less RAM. It also loads all known openoffice dictionaries without error, and supports an additional longestOnly option for a less aggressive approach. (Robert Muir)

asfimport commented 10 years ago

Chris Male (migrated from JIRA)

Is the longestOnly option a standard Hunspell thing? (more a question of general interest)

apache / lucene

Hunspell very high memory use when loading dictionary [LUCENE-5468] #6531