dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

Phonetic ngrams in POS ngram DFE can't be turned off? Crash on some characters #133

Closed daxenberger closed 9 years ago

daxenberger commented 9 years ago

Originally reported on Google Code with ID 133

What steps will reproduce the problem?
1. Run the lucene POS ngram DFE on twitter data
2. Get soundex exception of "character cannot be phonetized"
3. The phonetization seems to be set in POS meta collector, cannot be turned off from
FE parameters in groovy

What is the expected output? What do you see instead?

These two things (POS ngrams and phonetic ngrams) shall be definitely separable. I
am confused about this revision, is that intended this way?? If so, why?

  public class LucenePOSNGramMetaCollector 
      extends LuceneBasedMetaCollector 
  { 
-    @ConfigurationParameter(name =   
LucenePOSNGramFeatureExtractorBase.PARAM_POS_NGRAM_MIN_N, mandatory = true,   
defaultValue = "1") 
+    @ConfigurationParameter(name =   
LucenePhoneticNGramFeatureExtractorBase.PARAM_PHONETIC_NGRAM_MIN_N,   
mandatory = true, defaultValue = "1") 
      private int posNgramMinN; 

Reported by l.flekova on 2014-05-23 21:13:27

daxenberger commented 9 years ago
You are right to be confused.
A bad case of c&p gone wrong.
Thanks for spotting that.
I will fix that so that POS ngram is going to work again as expected.

The issue with "character cannot be phonetized" should however be fixed as well.
Do you have an example that causes this exception. 

Reported by torsten.zesch on 2014-05-24 09:03:58

daxenberger commented 9 years ago
Thx. Most likely it was either #, @ or a smiley, given that it was reading tweets. I'll
check that out once my current pipeline finishes.

Reported by l.flekova on 2014-05-24 09:51:58

daxenberger commented 9 years ago
I tried with a bunch of special characters and they all worked fine.

So it would be good to have the bad string in order to help me reproduce the problem.

Reported by torsten.zesch on 2014-05-24 18:46:48

daxenberger commented 9 years ago
Okay, so it is this character: ʉ in this sequence: �ʉ�_ which happens to be present
in some of the hyperlinks. Probably I messed up some escaped character sequence in
the data, so the error is between the chair and the laptop ;) For normal characters
it should be fairly failsafe :-) 

Caused by: java.lang.IllegalArgumentException: The character is not mapped: Ʉ
    at org.apache.commons.codec.language.Soundex.map(Soundex.java:226)
    at org.apache.commons.codec.language.Soundex.getMappingCode(Soundex.java:180)
    at org.apache.commons.codec.language.Soundex.soundex(Soundex.java:264)
    at org.apache.commons.codec.language.Soundex.encode(Soundex.java:162)
    at de.tudarmstadt.ukp.dkpro.tc.features.ngram.util.NGramUtils.getDocumentPhoneticNgrams(NGramUtils.java:167)

It happens in the MetaTask which 

Reported by l.flekova on 2014-05-24 20:06:34

daxenberger commented 9 years ago
I tested with those characters and I got no mapping errors here.

So we will leave that issue closed until someone runs into the same problem again :)

Reported by torsten.zesch on 2014-05-24 20:10:13

daxenberger commented 9 years ago

Reported by daxenberger.j on 2014-06-13 15:18:04