Lucene writeToIndex fails if no ngrams in one doc

ashokpant / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc

Other

0 stars 0 forks source link

Lucene writeToIndex fails if no ngrams in one doc #141

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

LuceneBasedMetaCollector.writeToIndex() throws an exception in the circumstance 
where (as best I can determine) a lucene Field contains no Value for a 
particular document.  This can happen when restrictions on permissible ngrams 
in an Ngram FE causes a particular text to produce no usable ngram features 
during meta collection.

Preferred behavior would be to produce a warning, or require a user override 
("Yes, I know some docs will produce 0 ngrams.")

I found this bug by using KeywordComboNGramPairFeatureExtractor with a short 
list of rare keywords.

I am attaching the stack trace of my exception.

Original issue reported on code.google.com by EmilyKJa...@gmail.com on 9 Jun 2014 at 1:45

Attachments:

lucene-empty-ngrams-exception.txt

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r873.

Original comment by torsten....@gmail.com on 9 Jun 2014 at 2:16

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

r873 was helpful, but since it doesn't technically solve this Issue's problem, 
I am reopening the issue until a solution is committed.

This original issue concerned no usable ngrams (not empty text), which should 
probably be solved in all affected MetaCollectors by first collecting the 
ngrams (in this case, charNGrams):
    public void process(JCas jcas)
        throws AnalysisEngineProcessException
    {
        FrequencyDistribution<String> charNGrams = NGramUtils.getCharacterSkipNgrams(
                jcas, ngramLowerCase, minN, maxN, skipSize);

and then, if the ngrams fd isn't empty, creating the lucene document:

        initializeDocument(jcas);

Original comment by EmilyKJa...@gmail.com on 9 Jun 2014 at 3:33

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Actually, it looks like process() should be refactored away; no reason to 
implement this fix 8 times, when the only thing that changes is the contents of 
the fd and the Field name.

Original comment by EmilyKJa...@gmail.com on 9 Jun 2014 at 3:46

GoogleCodeExporter commented 9 years ago

First, process() cannot be "refactored away" as this is exactly where the 
functionality of the different lucene-based meta collectors are implemented.

Second, could you please describe a bit more clearly why the fix doesn't solve 
the problem?

Original comment by torsten....@gmail.com on 9 Jun 2014 at 4:05

GoogleCodeExporter commented 9 years ago

1. Since the only things that change between the (single doc) metacollectors 
are the contents of a single fd and also the Field name, it looks reasonable to 
put process() into LuceneBasedMetaCollector and have this process() call 2 
abstract methods that fetch the contents of the fd and the Field name.  
Benefit: Instantiating the lucene doc would be hidden from the user, less 
exposed for copy-paste errors, and there would be less code duplication.  
Drawback: If someone needed a more customized lucene ngram metacollector, they 
would have to override the (proposed) LuceneBasedMetaCollector.process() and 
implement their own version.  Also, there would need to be a 
LuceneBasedPairmetaCollector for an equivalent for the pairs metacollectors
2. On second look, you're right, the solution is fine.  I mis-interpreted 
"document" in the commit notes.

Original comment by EmilyKJa...@gmail.com on 9 Jun 2014 at 4:31

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Regarding 1)

If you think that this makes things clearer, please go ahead with the 
refactoring.

Original comment by torsten....@gmail.com on 9 Jun 2014 at 4:47

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 13 Jun 2014 at 3:18

Added labels: Release-0.6.0