Closed asfimport closed 10 years ago
Robert Muir (@rmuir) (migrated from JIRA)
By comparison Stempel using the same dictionary file works just fine with 1/8 of that (and possibly lower values as well).
I imagine Stempel's Trie is good, but have you also compared Morfologik (http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/morfologik/) ? Its precompiled FST might be the most space-efficient for polish.
But really I think Hunspell's dictionary structure should be more efficient, we could build the FST on-the-fly (if case-insensitive mode is off). But when this is on, entries must be merged.
Instead it might be better for the hunspell stuff to support loading FSTs (where we would do any case-sensitivity tweaking/merging of entries, then build FST). It might be possible to re-use some of the same code from SOLR-2888 that does a similar thing to build a suggester FST.
In my opinion its worth it to build the FST not just for the words, but also the affixes (in some files these are humungous too!)
For lucene I think we would just allow HunspellDictionary to also be instantiated from these FST inputstreams. The solr factory / configuration would need to be tweaked to make this easy and intuitive.
Dawid Weiss (@dweiss) (migrated from JIRA)
Morfologik will be exactly the same size in memory as its unzipped dictionary, so about 1.8MB + 3.5MB if you use both pl (morfologik) and pl-sgjp (morfeusz) dictionaries. These are fixed dictionaries (that is unknown words won't be stemmed) but the coverage is decent for contemporary Polish.
If you explain what you're trying to do/ achieve then perhaps we'll be able to give you some more hints.
Chris Male (migrated from JIRA)
+1 to your idea Robert. I've been thinking along the same lines that FSTs might help us out here.
Maciej Lisiewski (migrated from JIRA)
The last time I checked Morfologik was just mentioned as a possible new stemmer - I have used it before and I prefer it to Stempel/Hunspell, so I guess this solves my problem for now, thanks :-)
As for Hunspell IMHO 2GB heap just to load dictionary makes it borderline unusable for some languages.
Robert Muir (@rmuir) (migrated from JIRA)
As for Hunspell IMHO 2GB heap just to load dictionary makes it borderline unusable for some languages.
Right but honestly the original motivation was to get something up quickly when you have no other choice: for minority languages, etc.
Dawid Weiss (@dweiss) (migrated from JIRA)
You know what they say these days – just buy more ram and get rid of the problem by covering it with money :)
Robert Muir (@rmuir) (migrated from JIRA)
yeah but the HunspellDictionary really is ridiculous if you try to use a large dictionary with it, even without cutting over to an FST it could probably be improved.
for minority languages without really nice dictionaries it probably doesnt matter much, but for the languages with really nice dictionaries you also tend to have language-specific options available.
just another crazy idea: I don't know how much of morfologik is dependent upon polish itself, but if it already knows how to compile ispell/hunspell into an efficient form and work with it, maybe we should just be seeing if we can 'generalize' that and work it from that angle.
Dawid Weiss (@dweiss) (migrated from JIRA)
I must disappoint you here – morfologik simply compiles a list of inflected-base-tag triples, it has no logic for generating these forms from lexical flags/ base dictionaries. Marcin Miłkowski has a set of scripts for that and he, as far as I recall, used aspell/ ispell to "dump" all of their forms by feeding the input dictionary basically. I think hunspell provides more intelligent handling of words outside of the dictionary so there's value in it that morfologik doesn't have.
Robert Muir (@rmuir) (migrated from JIRA)
Marcin Miłkowski has a set of scripts for that and he, as far as I recall, used aspell/ ispell to "dump" all of their forms by feeding the input dictionary basically. I think hunspell provides more intelligent handling of words outside of the dictionary so there's value in it that morfologik doesn't have.
I think what you describe is essentially at a highlevel exactly what the hunspellfilter does. Theoretically there is more intelligent handling possible (correcting spelling), but this isn't implemented, not interesting for search anyway for the most part, and there is definitely no OOV mechanism.
Dawid Weiss (@dweiss) (migrated from JIRA)
You're probably right – my opinion was based on my inspection of hunspell's source code that I did once or twice in the past – I remember there's logic to perform more advanced stuff than dictionary lookup, but I never got the full picture if or how it's used.
Robert Muir (@rmuir) (migrated from JIRA)
I'm working on a quick 80/20 stab here. I think it will help a lot.
Robert Muir (@rmuir) (migrated from JIRA)
here's a patch cutting this thing over to use less ram once its started. but it probably uses more initially when parsing, mainly because we cannot guarantee the input is in sorted order. I think we should fix that, so that jumping thru hoops is the exception rather than the rule:
the building could just do the 2-phase thing it does now for the crazy cases and be efficient for the 90% case if we clean up.
The remaining problems:
Chris Male (migrated from JIRA)
Hey, patch looks cool Robert.
we allow multiple dictionary files... is this really needed?
I don't think so.
solr should never instantiate more than one of the same dictionary across different fields (thats a factory issue, i'm not going to deal with it here, but its just stupid if the factory does this)
Thats a really good point actually. Makes me wonder whether there are other files / datastructures in analysis factories that are in the same boat?
Robert Muir (@rmuir) (migrated from JIRA)
Makes me wonder whether there are other files / datastructures in analysis factories that are in the same boat?
Maybe synonyms too? I dunno, just seems like if factories implement ResourceLoaderAware, instead of calling init() and inform() on all of them, instead they should be able to parse their params in init(), override equals/hashcode based on their parameters, and some mechanism would just then reuse existing ones instead of creating duplicates.
Dawid Weiss (@dweiss) (migrated from JIRA)
Looks good to me from looking at the diff. Btw., we really should pull out the getOutputForInput(FST, input) logic currently present in lookupOrd somewhere where it's reusable – I've seen it in a few places (or needed it a few times)...
Jan Høydahl (@janhoy) (migrated from JIRA)
Background for supporting multiple dictionaries is here: http://code.google.com/p/lucene-hunspell/issues/detail?id=4 and is invaluable for adding local customizations or overrides without touching the official dictionaries.
Robert Muir (@rmuir) (migrated from JIRA)
at least the local override/customizations files can surely require sorted order?
Maciej Lisiewski (migrated from JIRA)
Those overrides/customizations will be tiny when compared to the main dictionary - is the sorting really an issue here? Simple example: default PL dictionary is close to 200k words. Largest custom dictionaries (legal, military, medical) will be 5-10k words (I'm basing those estimates on the best sources that I have found to generate those dictionaries from). In most cases we should expect <1k words.
Robert Muir (@rmuir) (migrated from JIRA)
Those overrides/customizations will be tiny when compared to the main dictionary - is the sorting really an issue here?
Doesn't matter here, our FST requires that it be built in-order. doesn't matter if even one single word is out of order.
because of this, we can't build the data structure efficiently.
Maciej Lisiewski (migrated from JIRA)
What I was trying to say is that the custom dictionaries are small enough to be loaded and sorted in memory before building FST.
Robert Muir (@rmuir) (migrated from JIRA)
Also, its required by the hunspell format itself. So this is not crazy to enforce.
Chris Male (migrated from JIRA)
I don't see any problem mandating that overrides/customizations adhere to a sorted order. I don't think we can assume custom dictionaries are going to be small - there's nothing in the APIs which force that. Using FSTs gives us the performance benefit we're seeking in this issue, I think the small sacrifice is worth the huge benefit.
Dawid Weiss (@dweiss) (migrated from JIRA)
You can always sort inside your application if you're not sure if the words come or not in sorted order, Maciej. Lucene/Solr now even has on-disk merge sort which you can use for large(r) data sets – this code is along FSTCompletion in trunk.
Robert Muir (@rmuir) (migrated from JIRA)
note: in some cases we will still have to use the throwaway treemap or similar like the patch i uploaded does.
but we could then know these two cases up front:
Robert Muir (@rmuir) (migrated from JIRA)
You can always sort inside your application if you're not sure if the words come or not in sorted order, Maciej
Well someone has to sort to 'test' any dictionary customizations with hunspells tools anyway.
So i assume people are already doing 'sort foo.dic my_foo_customizations.dic > combined.dic' then using 'analyze' and other commands to test... otherwise how are they testing their customizations ?!
Maho NAKATA (migrated from JIRA)
Dictionaries with the same file location should be shared across all field and all indexes. This would minimize the problem if you're using multiple indexes.
Currently I can't use Solr because I have 10 indexes with 5 field and for each field a DictionaryCompoundWordTokenFilterFactory is assigned. So the dictionary will be loaded 50 times. This is too much for my RAM.
Maho NAKATA (migrated from JIRA)
I now solved the problem in my special case. I wrote a custom TokenFilterFactory that wraps the DictionaryCompoundWordTokenFilterFactory / HunspellStemFilterFactory and caches the factories, so they will be reused across indexes and fieldtypes.
Robert Muir (@rmuir) (migrated from JIRA)
I don't think we should let some esoteric options like multiple dictionaries keep this stuff unusable.
So I'm happy to just fork the entire stuff into a different package (hunspell2 or something), so we have a reasonably efficient version that doesnt have these esoteric options. The old stuff can stay as is, i do not care.
Chris Male (migrated from JIRA)
Multiple dictionaries was never in the original design either. Having an efficient and usable design seems to be of higher priority so +1 to not forking and doing this in place.
Robert Muir (@rmuir) (migrated from JIRA)
Well, I don't want the whole issue to get hung up on that stuff. Basically i'm working on a number of changes (especially tests though, to ensure the stuff is really working correctly). If we want, we can just lay down my new files on top of the existing stuff, or we can keep it/deprecate it, whatever we want to do.
I just want to make some progress on a few improvements I've been investigating to try to make this thing more usable :)
Chris Male (migrated from JIRA)
Sounds good
ASF subversion and git services (migrated from JIRA)
Commit 1571137 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571137
LUCENE-5468: commit current state
Robert Muir (@rmuir) (migrated from JIRA)
I brought the previous FST patch up to speed, and then built a test to parse many dictionaries and compare memory. When it says FAIL, thats because the current code can't parse the dictionary (i fixed all the issues here).
In general, RAM use is better, but in some cases its still bad because of how the affixes are represented. I still havent removed my Treemap yet either (i wanted to have a way to test all the dictionaries like this before really locking things down).
dict | old RAM | new RAM |
---|---|---|
af_ZA.zip | 18 MB | 899 KB |
ak_GH.zip | 1.5 MB | 71 KB |
bg_BG.zip | FAIL | 1.1 MB |
ca_ANY.zip | 28.9 MB | 1.2 MB |
ca_ES.zip | 15.1 MB | 1.2 MB |
cop_EG.zip | 2.1 MB | 489.3 KB |
cs_CZ.zip | 50.4 MB | 2.8 MB |
cy_GB.zip | FAIL | 1.6 MB |
da_DK.zip | FAIL | 750.8 KB |
de_AT.zip | 1.3 MB | 293.1 KB |
de_CH.zip | 12.6 MB | 895.6 KB |
de_DE.zip | 12.6 MB | 895 KB |
de_DE_comb.zip | 102.2 MB | 4.8 MB |
de_DE_frami.zip | 20.9 MB | 1.2 MB |
de_DE_neu.zip | 101.5 MB | 4.8 MB |
el_GR.zip | 74.3 MB | 1.1 MB |
en_AU.zip | 8.1 MB | 1.2 MB |
en_CA.zip | 9.8 MB | 436.7 KB |
en_GB-oed.zip | 8.2 MB | 1.2 MB |
en_GB.zip | 8.3 MB | 1.2 MB |
en_NZ.zip | 8.4 MB | 1.2 MB |
eo.zip | 4.9 MB | 1.3 MB |
eo_EO.zip | 4.9 MB | 1.3 MB |
es_AR.zip | 14.8 MB | 3.9 MB |
es_BO.zip | 14.8 MB | 3.9 MB |
es_CL.zip | 14.7 MB | 3.9 MB |
es_CO.zip | 14.3 MB | 3.8 MB |
es_CR.zip | 14.8 MB | 3.9 MB |
es_CU.zip | 14.7 MB | 3.9 MB |
es_DO.zip | 14.7 MB | 3.9 MB |
es_EC.zip | 14.8 MB | 3.9 MB |
es_ES.zip | 15.1 MB | 4.1 MB |
es_GT.zip | 14.8 MB | 3.9 MB |
es_HN.zip | 14.8 MB | 3.9 MB |
es_MX.zip | 14.3 MB | 3.8 MB |
es_NEW.zip | 15.5 MB | 4.2 MB |
es_NI.zip | 14.8 MB | 3.9 MB |
es_PA.zip | 14.8 MB | 3.9 MB |
es_PE.zip | 14.2 MB | 3.8 MB |
es_PR.zip | 14.7 MB | 3.9 MB |
es_PY.zip | 14.8 MB | 3.9 MB |
es_SV.zip | 14.8 MB | 3.9 MB |
es_UY.zip | 14.8 MB | 3.9 MB |
es_VE.zip | 14.3 MB | 3.8 MB |
et_EE.zip | 53.6 MB | 5.9 MB |
fo_FO.zip | 18.6 MB | 485.7 KB |
fr_FR-1990_1-3-2.zip | 14 MB | 636.4 KB |
fr_FR-classique_1-3-2.zip | 14 MB | 743.1 KB |
fr_FR_1-3-2.zip | 14.5 MB | 755.2 KB |
fy_NL.zip | 4.2 MB | 272.8 KB |
ga_IE.zip | 14 MB | 674.8 KB |
gd_GB.zip | 2.7 MB | 111 KB |
gl_ES.zip | FAIL | 1.2 MB |
gsc_FR.zip | FAIL | 1.4 MB |
gu_IN.zip | 20.3 MB | 914.9 KB |
he_IL.zip | 53.3 MB | 1.8 MB |
hi_IN.zip | 2.7 MB | 136.9 KB |
hil_PH.zip | 3.4 MB | 164.8 KB |
hr_HR.zip | 29.7 MB | 564.8 KB |
hu_HU.zip | FAIL | 17.6 MB |
hu_HU_comb.zip | FAIL | 19.9 MB |
ia.zip | 4.9 MB | 211.9 KB |
id_ID.zip | 3.9 MB | 218.4 KB |
it_IT.zip | 15.3 MB | 1.6 MB |
ku_TR.zip | 1.6 MB | 147.6 KB |
la.zip | 5.1 MB | 2.5 MB |
lt_LT.zip | 15 MB | 2.8 MB |
lv_LV.zip | 36.3 MB | 1.9 MB |
mg_MG.zip | 2.9 MB | 131.7 KB |
mi_NZ.zip | FAIL | 171.2 KB |
mk_MK.zip | FAIL | 436.9 KB |
mos_BF.zip | 13.3 MB | 210 KB |
mr_IN.zip | FAIL | 115.5 KB |
ms_MY.zip | 4.1 MB | 221.6 KB |
nb_NO.zip | 22.9 MB | 1.4 MB |
ne_NP.zip | 5.5 MB | 495.6 KB |
nl_NL.zip | 22.9 MB | 1.1 MB |
nl_med.zip | 1.2 MB | 60.2 KB |
nn_NO.zip | 16.5 MB | 1 MB |
nr_ZA.zip | 3.1 MB | 171.1 KB |
ns_ZA.zip | 1.7 MB | 85.8 KB |
ny_MW.zip | FAIL | 69.6 KB |
oc_FR.zip | 9.1 MB | 690.5 KB |
pl_PL.zip | 43.9 MB | 4.9 MB |
pt_BR.zip | FAIL | 3.9 MB |
pt_PT.zip | 5.8 MB | 773.4 KB |
ro_RO.zip | 5.1 MB | 226.2 KB |
ru_RU.zip | 21.7 MB | 1.4 MB |
ru_RU_ye.zip | 43.7 MB | 1.6 MB |
ru_RU_yo.zip | 21.7 MB | 1.4 MB |
rw_RW.zip | 1.6 MB | 70.1 KB |
sk_SK.zip | 25.1 MB | 2.3 MB |
sl_SI.zip | 38.3 MB | 806.6 KB |
sq_AL.zip | 28.9 MB | 654.6 KB |
ss_ZA.zip | 3.1 MB | 176.3 KB |
st_ZA.zip | 1.7 MB | 86.5 KB |
sv_SE.zip | 9.5 MB | 668.8 KB |
sw_KE.zip | 6.3 MB | 286 KB |
tet_ID.zip | 2 MB | 92.4 KB |
th_TH.zip | FAIL | 377.4 KB |
tl_PH.zip | 2.6 MB | 116.5 KB |
tn_ZA.zip | 1.5 MB | 61.6 KB |
ts_ZA.zip | 1.6 MB | 81 KB |
uk_UA.zip | 17.6 MB | 3 MB |
ve_ZA.zip | FAIL | 108.8 KB |
vi_VN.zip | 1.7 MB | 53.6 KB |
xh_ZA.zip | 3 MB | 158.9 KB |
zu_ZA.zip | 24.5 MB | 13.5 MB |
ASF subversion and git services (migrated from JIRA)
Commit 1571321 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571321
LUCENE-5468: factor OfflineSorter out of suggest
ASF subversion and git services (migrated from JIRA)
Commit 1571356 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571356
LUCENE-5468: sort dictionary data with offline sorter
ASF subversion and git services (migrated from JIRA)
Commit 1571788 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571788
LUCENE-5468: deduplicate patterns used by affix condition check
ASF subversion and git services (migrated from JIRA)
Commit 1571802 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571802
LUCENE-5468: remove redundant 'append' in Affix
ASF subversion and git services (migrated from JIRA)
Commit 1571807 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571807
LUCENE-5468: Stem -> CharsRef
ASF subversion and git services (migrated from JIRA)
Commit 1571844 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1571844
LUCENE-5468: make Affix fixed-width
ASF subversion and git services (migrated from JIRA)
Commit 1572643 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572643
LUCENE-5468: don't create unnecessary objects
ASF subversion and git services (migrated from JIRA)
Commit 1572660 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572660
LUCENE-5468: encode affix data as 8 bytes per affix, before cutting over to FST
ASF subversion and git services (migrated from JIRA)
Commit 1572666 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572666
LUCENE-5468: convert affixes to FST
Robert Muir (@rmuir) (migrated from JIRA)
I am finished compressing for now. I think its pretty reasonable across all the languages.
I will cleanup and try to add back the multiple dictionary/ignore case stuff and clean up some other things.
dict | old RAM | new RAM | |||
---|---|---|---|---|---|
af_ZA.zip | 18 MB | 917.1 KB | |||
ak_GH.zip | 1.5 MB | 103.2 KB | |||
bg_BG.zip | FAIL | 465.7 KB | |||
ca_ANY.zip | 28.9 MB | 675.4 KB | |||
ca_ES.zip | 15.1 MB | 639.8 KB | |||
cop_EG.zip | 2.1 MB | 144.5 KB | |||
cs_CZ.zip | 50.4 MB | 1.5 MB | |||
cy_GB.zip | FAIL | 627.4 KB | |||
da_DK.zip | FAIL | 669.8 KB | |||
de_AT.zip | 1.3 MB | 123.9 KB | |||
de_CH.zip | 12.6 MB | 725.4 KB | |||
de_DE.zip | 12.6 MB | 726 KB | |||
de_DE_comb.zip | 102.2 MB | 4.2 MB | |||
de_DE_frami.zip | 20.9 MB | 1023.5 KB | |||
de_DE_neu.zip | 101.5 MB | 4.2 MB | |||
el_GR.zip | 74.3 MB | 1 MB | |||
en_AU.zip | 8.1 MB | 521 KB | |||
en_CA.zip | 9.8 MB | 450.5 KB | |||
en_GB-oed.zip | 8.2 MB | 526.6 KB | |||
en_GB.zip | 8.3 MB | 527.3 KB | |||
en_NZ.zip | 8.4 MB | 532.4 KB | |||
eo.zip | 4.9 MB | 310.5 KB | |||
eo_EO.zip | 4.9 MB | 310.5 KB | |||
es_AR.zip | 14.8 MB | 734.9 KB | |||
es_BO.zip | 14.8 MB | 735 KB | |||
es_CL.zip | 14.7 MB | 734.9 KB | |||
es_CO.zip | 14.3 MB | 722.1 KB | |||
es_CR.zip | 14.8 MB | 733.9 KB | |||
es_CU.zip | 14.7 MB | 732.8 KB | |||
es_DO.zip | 14.7 MB | 731.9 KB | |||
es_EC.zip | 14.8 MB | 733.5 KB | |||
es_ES.zip | 15.1 MB | 743 KB | |||
es_GT.zip | 14.8 MB | 734.5 KB | |||
es_HN.zip | 14.8 MB | 735.2 KB | |||
es_MX.zip | 14.3 MB | 723.8 KB | |||
es_NEW.zip | 15.5 MB | 768.5 KB | |||
es_NI.zip | 14.8 MB | 734.5 KB | |||
es_PA.zip | 14.8 MB | 733.8 KB | |||
es_PE.zip | 14.2 MB | 721.3 KB | |||
es_PR.zip | 14.7 MB | 732.4 KB | |||
es_PY.zip | 14.8 MB | 734.1 KB | |||
es_SV.zip | 14.8 MB | 733.6 KB | |||
es_UY.zip | 14.8 MB | 736.9 KB | |||
es_VE.zip | 14.3 MB | 722.7 KB | |||
et_EE.zip | 53.6 MB | 473.6 KB | |||
fo_FO.zip | 18.6 MB | 517.9 KB | |||
fr_FR-1990_1-3-2.zip | 14 MB | 526.7 KB | |||
fr_FR-classique_1-3-2.zip | 14 MB | 539.2 KB | |||
fr_FR_1-3-2.zip | 14.5 MB | 550.4 KB | |||
fy_NL.zip | 4.2 MB | 265.6 KB | |||
ga_IE.zip | 14 MB | 460.6 KB | |||
gd_GB.zip | 2.7 MB | 143.1 KB | |||
gl_ES.zip | FAIL | 479.4 KB | |||
gsc_FR.zip | FAIL | 1.3 MB | |||
gu_IN.zip | 20.3 MB | 947 KB | |||
he_IL.zip | 53.3 MB | 539.2 KB | |||
hi_IN.zip | 2.7 MB | 169 KB | |||
hil_PH.zip | 3.4 MB | 197 KB | |||
hr_HR.zip | 29.7 MB | 573 KB | |||
hu_HU.zip | FAIL | 1.2 MB | |||
hu_HU_comb.zip | FAIL | 5.4 MB | |||
ia.zip | 4.9 MB | 222.9 KB | |||
id_ID.zip | 3.9 MB | 226.3 KB | |||
it_IT.zip | 15.3 MB | 612.9 KB | |||
ku_TR.zip | 1.6 MB | 118.7 KB | |||
la.zip | 5.1 MB | 199.3 KB | |||
lt_LT.zip | 15 MB | 682.5 KB | |||
lv_LV.zip | 36.3 MB | 763.9 KB | |||
mg_MG.zip | 2.9 MB | 163.8 KB | |||
mi_NZ.zip | FAIL | 191.4 KB | |||
mk_MK.zip | FAIL | 469.1 KB | |||
mos_BF.zip | 13.3 MB | 242.2 KB | |||
mr_IN.zip | FAIL | 147.7 KB | |||
ms_MY.zip | 4.1 MB | 226.9 KB | |||
nb_NO.zip | 22.9 MB | 1.2 MB | |||
ne_NP.zip | 5.5 MB | 328.1 KB | |||
nl_NL.zip | 22.9 MB | 1.1 MB | |||
nl_med.zip | 1.2 MB | 92.3 KB | |||
nn_NO.zip | 16.5 MB | 914 KB | |||
nr_ZA.zip | 3.1 MB | 203.3 KB | |||
ns_ZA.zip | 1.7 MB | 118 KB | |||
ny_MW.zip | FAIL | 101.8 KB | |||
oc_FR.zip | 9.1 MB | 401.5 KB | |||
pl_PL.zip | 43.9 MB | 1.7 MB | |||
pt_BR.zip | FAIL | 2.1 MB | |||
pt_PT.zip | 5.8 MB | 379.4 KB | |||
ro_RO.zip | 5.1 MB | 256.3 KB | |||
ru_RU.zip | 21.7 MB | 882 KB | |||
ru_RU_ye.zip | 43.7 MB | 1.5 MB | |||
ru_RU_yo.zip | 21.7 MB | 897.3 KB | |||
rw_RW.zip | 1.6 MB | 102.3 KB | |||
sk_SK.zip | 25.1 MB | 1.2 MB | |||
sl_SI.zip | 38.3 MB | 604 KB | af_ZA.zip | 18 MB | 917.1 KB |
ak_GH.zip | 1.5 MB | 103.2 KB | |||
bg_BG.zip | FAIL | 465.7 KB | |||
ca_ANY.zip | 28.9 MB | 675.4 KB | |||
ca_ES.zip | 15.1 MB | 639.8 KB | |||
cop_EG.zip | 2.1 MB | 144.5 KB | |||
cs_CZ.zip | 50.4 MB | 1.5 MB | |||
cy_GB.zip | FAIL | 627.4 KB | |||
da_DK.zip | FAIL | 669.8 KB | |||
de_AT.zip | 1.3 MB | 123.9 KB | |||
de_CH.zip | 12.6 MB | 725.4 KB | |||
de_DE.zip | 12.6 MB | 726 KB | |||
de_DE_comb.zip | 102.2 MB | 4.2 MB | |||
de_DE_frami.zip | 20.9 MB | 1023.5 KB | |||
de_DE_neu.zip | 101.5 MB | 4.2 MB | |||
el_GR.zip | 74.3 MB | 1 MB | |||
en_AU.zip | 8.1 MB | 521 KB | |||
en_CA.zip | 9.8 MB | 450.5 KB | |||
en_GB-oed.zip | 8.2 MB | 526.6 KB | |||
en_GB.zip | 8.3 MB | 527.3 KB | |||
en_NZ.zip | 8.4 MB | 532.4 KB | |||
eo.zip | 4.9 MB | 310.5 KB | |||
eo_EO.zip | 4.9 MB | 310.5 KB | |||
es_AR.zip | 14.8 MB | 734.9 KB | |||
es_BO.zip | 14.8 MB | 735 KB | |||
es_CL.zip | 14.7 MB | 734.9 KB | |||
es_CO.zip | 14.3 MB | 722.1 KB | |||
es_CR.zip | 14.8 MB | 733.9 KB | |||
es_CU.zip | 14.7 MB | 732.8 KB | |||
es_DO.zip | 14.7 MB | 731.9 KB | |||
es_EC.zip | 14.8 MB | 733.5 KB | |||
es_ES.zip | 15.1 MB | 743 KB | |||
es_GT.zip | 14.8 MB | 734.5 KB | |||
es_HN.zip | 14.8 MB | 735.2 KB | |||
es_MX.zip | 14.3 MB | 723.8 KB | |||
es_NEW.zip | 15.5 MB | 768.5 KB | |||
es_NI.zip | 14.8 MB | 734.5 KB | |||
es_PA.zip | 14.8 MB | 733.8 KB | |||
es_PE.zip | 14.2 MB | 721.3 KB | |||
es_PR.zip | 14.7 MB | 732.4 KB | |||
es_PY.zip | 14.8 MB | 734.1 KB | |||
es_SV.zip | 14.8 MB | 733.6 KB | |||
es_UY.zip | 14.8 MB | 736.9 KB | |||
es_VE.zip | 14.3 MB | 722.7 KB | |||
et_EE.zip | 53.6 MB | 473.6 KB | |||
fo_FO.zip | 18.6 MB | 517.9 KB | |||
fr_FR-1990_1-3-2.zip | 14 MB | 526.7 KB | |||
fr_FR-classique_1-3-2.zip | 14 MB | 539.2 KB | |||
fr_FR_1-3-2.zip | 14.5 MB | 550.4 KB | |||
fy_NL.zip | 4.2 MB | 265.6 KB | |||
ga_IE.zip | 14 MB | 460.6 KB | |||
gd_GB.zip | 2.7 MB | 143.1 KB | |||
gl_ES.zip | FAIL | 479.4 KB | |||
gsc_FR.zip | FAIL | 1.3 MB | |||
gu_IN.zip | 20.3 MB | 947 KB | |||
he_IL.zip | 53.3 MB | 539.2 KB | |||
hi_IN.zip | 2.7 MB | 169 KB | |||
hil_PH.zip | 3.4 MB | 197 KB | |||
hr_HR.zip | 29.7 MB | 573 KB | |||
hu_HU.zip | FAIL | 1.2 MB | |||
hu_HU_comb.zip | FAIL | 5.4 MB | |||
ia.zip | 4.9 MB | 222.9 KB | |||
id_ID.zip | 3.9 MB | 226.3 KB | |||
it_IT.zip | 15.3 MB | 612.9 KB | |||
ku_TR.zip | 1.6 MB | 118.7 KB | |||
la.zip | 5.1 MB | 199.3 KB | |||
lt_LT.zip | 15 MB | 682.5 KB | |||
lv_LV.zip | 36.3 MB | 763.9 KB | |||
mg_MG.zip | 2.9 MB | 163.8 KB | |||
mi_NZ.zip | FAIL | 191.4 KB | |||
mk_MK.zip | FAIL | 469.1 KB | |||
mos_BF.zip | 13.3 MB | 242.2 KB | |||
mr_IN.zip | FAIL | 147.7 KB | |||
ms_MY.zip | 4.1 MB | 226.9 KB | |||
nb_NO.zip | 22.9 MB | 1.2 MB | |||
ne_NP.zip | 5.5 MB | 328.1 KB | |||
nl_NL.zip | 22.9 MB | 1.1 MB | |||
nl_med.zip | 1.2 MB | 92.3 KB | |||
nn_NO.zip | 16.5 MB | 914 KB | |||
nr_ZA.zip | 3.1 MB | 203.3 KB | |||
ns_ZA.zip | 1.7 MB | 118 KB | |||
ny_MW.zip | FAIL | 101.8 KB | |||
oc_FR.zip | 9.1 MB | 401.5 KB | |||
pl_PL.zip | 43.9 MB | 1.7 MB | |||
pt_BR.zip | FAIL | 2.1 MB | |||
pt_PT.zip | 5.8 MB | 379.4 KB | |||
ro_RO.zip | 5.1 MB | 256.3 KB | |||
ru_RU.zip | 21.7 MB | 882 KB | |||
ru_RU_ye.zip | 43.7 MB | 1.5 MB | |||
ru_RU_yo.zip | 21.7 MB | 897.3 KB | |||
rw_RW.zip | 1.6 MB | 102.3 KB | |||
sk_SK.zip | 25.1 MB | 1.2 MB | |||
sl_SI.zip | 38.3 MB | 604 KB | |||
sq_AL.zip | 28.9 MB | 581.7 KB | |||
ss_ZA.zip | 3.1 MB | 208.5 KB | |||
st_ZA.zip | 1.7 MB | 118.7 KB | |||
sv_SE.zip | 9.5 MB | 535.4 KB | |||
sw_KE.zip | 6.3 MB | 318.2 KB | |||
tet_ID.zip | 2 MB | 124.5 KB | |||
th_TH.zip | FAIL | 409.6 KB | |||
tl_PH.zip | 2.6 MB | 148.7 KB | |||
tn_ZA.zip | 1.5 MB | 93.7 KB | |||
ts_ZA.zip | 1.6 MB | 113.1 KB | |||
uk_UA.zip | 17.6 MB | 979.1 KB | |||
ve_ZA.zip | FAIL | 140.9 KB | |||
vi_VN.zip | 1.7 MB | 85.8 KB | |||
xh_ZA.zip | 3 MB | 191.1 KB | |||
zu_ZA.zip | 24.5 MB | 827.1 KB |
sq_AL.zip | 28.9 MB | 581.7 KB |
---|---|---|
ss_ZA.zip | 3.1 MB | 208.5 KB |
st_ZA.zip | 1.7 MB | 118.7 KB |
sv_SE.zip | 9.5 MB | 535.4 KB |
sw_KE.zip | 6.3 MB | 318.2 KB |
tet_ID.zip | 2 MB | 124.5 KB |
th_TH.zip | FAIL | 409.6 KB |
tl_PH.zip | 2.6 MB | 148.7 KB |
tn_ZA.zip | 1.5 MB | 93.7 KB |
ts_ZA.zip | 1.6 MB | 113.1 KB |
uk_UA.zip | 17.6 MB | 979.1 KB |
ve_ZA.zip | FAIL | 140.9 KB |
vi_VN.zip | 1.7 MB | 85.8 KB |
xh_ZA.zip | 3 MB | 191.1 KB |
zu_ZA.zip | 24.5 MB | 827.1 KB |
Chris Male (migrated from JIRA)
Those are some pretty amazing reductions, well done!
Robert Muir (@rmuir) (migrated from JIRA)
I have the previous options added back too locally. so i will fix up tests and so on and just copy over the old filter and make a patch.
ASF subversion and git services (migrated from JIRA)
Commit 1572718 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572718
LUCENE-5468: hunspell2 -> hunspell (with previous options and tests)
ASF subversion and git services (migrated from JIRA)
Commit 1572724 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572724
LUCENE-5468: fix precommit+test
ASF subversion and git services (migrated from JIRA)
Commit 1572727 from @rmuir in branch 'dev/branches/lucene5468' https://svn.apache.org/r1572727
LUCENE-5468: add additional change
Robert Muir (@rmuir) (migrated from JIRA)
I think the change is ready. There are other improvements that can be done (for example, maybe an option for the factory to cache these things in case you use same ones across multiple fields, and more efficient affix handling against the FST, and so on), but it would be better on different issues I think?
Here is a patch (from diff-sources), sorry its not so useful, as I renamed some things. I tried making one from svn diff after reintegration, but it was equally useless. If you want you can also review my commits on this issue to the branch, too.
here is CHANGES entry:
API Changes:
Optimizations:
Chris Male (migrated from JIRA)
Is the longestOnly option a standard Hunspell thing? (more a question of general interest)
Hunspell stemmer requires gigantic (for the task) amounts of memory to load dictionary/rules files. For example loading a 4.5 MB polish dictionary (with empty index!) will cause whole core to crash with various out of memory errors unless you set max heap size close to 2GB or more. By comparison Stempel using the same dictionary file works just fine with 1/8 of that (and possibly lower values as well).
Sample error log entries: http://pastebin.com/fSrdd5W1 http://pastebin.com/Lmi0re7Z
Migrated from LUCENE-5468 by Maciej Lisiewski, 1 vote, resolved Feb 27 2014 Attachments: LUCENE-5468.patch, patch.txt