apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.57k stars 1.01k forks source link

it is impossible to use a custom dictionary for SmartChineseAnalyzer [LUCENE-1817] #2892

Closed asfimport closed 15 years ago

asfimport commented 15 years ago

it is not possible to use a custom dictionary, even though there is a lot of code and javadocs to allow this.

This is because the custom dictionary is only loaded if it cannot load the built-in one (which is of course, in the jar file and should load)

public synchronized static WordDictionary getInstance() {
    if (singleInstance == null) {
      singleInstance = new WordDictionary(); // load from jar file
      try {
        singleInstance.load();
      } catch (IOException e) { // loading from jar file must fail before it checks the AnalyzerProfile (where this can be configured)
        String wordDictRoot = AnalyzerProfile.ANALYSIS_DATA_DIR;
        singleInstance.load(wordDictRoot);
      } catch (ClassNotFoundException e) {
        throw new RuntimeException(e);
      }
    }
    return singleInstance;
  }

I think we should either correct this, document this, or disable custom dictionary support...


Migrated from LUCENE-1817 by Robert Muir (@rmuir), resolved Aug 27 2009 Attachments: dataFiles.zip, LUCENE-1817.patch (versions: 2), LUCENE-1817-mark-cn-experimental.patch

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I am looking at this today. One thing about this code that should also be corrected ASAP is that if you have a custom dictionary directory in .DCT format, the load() method will actually call save()

This will create a corresponding .MEM file in the same directory after loading the dictionary in DCT format.

I really do not think load() methods should be creating or writing to files.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

In my opinion, the loader should be able to load either .mem files (which should realy be named *.ser, because they are serialized java objects) or DCT format files (maybe autodetect) or two separate methods. If you want to quicker load the files later, you could also save the DCT as a serialized object after that, but this should be left to the user and not done automatically.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, i agree. currently it does do the autodetect (first checks for .MEM, then falls back on DCT). but if it has to fall back on DCT, it will create a .MEM file.

asfimport commented 15 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

We should mark the smartcn module experimental as we plan to do heavy refactoring after 2.9 is out. This patch adds a notice to package.html and JavaDoc. Quoting Mark Miller from the list:

Warning users that you don't plan on promising back compat with experimental warnings seems like a good idea to me.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

to make matters more complex, trying to load a bigram dictionary from a DCT file gave me:

# An unexpected error has been detected by Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000006dc378d0, pid=3140, tid=5912
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (11.2-b01 mixed mode windows-amd64)
# Problematic frame:
# V  [jvm.dll+0x3a78d0]

apparently this is some clover issue in my eclipse and i turned it off, so it is an unrelated problem.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch adds:

the patch requires some binary dct data files which I will try to upload as a zip

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

the two files in this directory need to be placed in smartcn/test under o/a/l/analysis/cn/smart/hmm/customDictionaryDCT

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i looked at this file format and I am going to create smaller custom dictionaries for testing.

this way we do not have huge files in svn

asfimport commented 15 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Robert, I commited the javadoc changes. Once you have smaller dict files feel free to commit your patch. If you run into problems I would prefer to skip the tests (and the dict files) and commit it without this simple test. This should be fine.

simon

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here is a javadocs-only patch that I think is the best solution.

This is because i created several custom dictionaries and found: 1) it will be difficult to support this dictionary format for a number of reasons 2) the dictionary format is limited to GB2312 encoding, and will not support things like traditional chinese 3) even when creating a correct file in the correct format, there are many assumptions about what should be in the dictionary. Especially things like WordDictionary.expandDelimiterData If these assumptions are not met, things like infinite loops occur.

I recommend we instead remove javadocs describing how to use a custom dictionary. And in this patch also expand the EXPERIMENTAL wording from just APIs, to both APIs and file formats. In the future we should refactor and use a unicode-based format.

I won't do anything here without some consensus that others feel it is the right way to go, but I think we should do this in 2.9

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I agree Robert - given your concerns, lots drop custom support for now (even if just at the javadoc lvl if you can't do custom anyway without rebuilding the jar).

+1

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i will wait a bit and see if anyone has an issue with this, otherwise i would like to commit at the end of the day.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I vote commit it now so it makes the RC - I can appreciate wanting to have consensus here - but silence is consensus in Lucene dev, - and twos often a crowd By the powers vested in me as the RM (which are, essentially, none) I say pop this baby in. People have a week to complain and force us to take it out. I think this one is fairly clear territory though. Lets put the first RC out with everything we know of taken care of. These are extraordinary times.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Committed revision 808570.

asfimport commented 12 years ago

xlzhang (migrated from JIRA)

I am interested in working on adding the feature to allow use customized dictionary in text file, not DCT file. I have a couple of questions before trying on it.

In the package, I only saw .mem file. Where should I download .dct file and how to convert from text file to dct file?