dhowe / RiTaV1

RiTa: the generative language toolkit
http://rednoise.org/rita
GNU General Public License v3.0
354 stars 78 forks source link

RiTa.getPosTags() always returns [nn] or [nns] for words not found in the dictionary rita_dict.js. #181

Closed CyrusSUEN closed 8 years ago

CyrusSUEN commented 9 years ago

testPosTagging() in KnownIssuesTest.java


  @Test
  public void testPosTagging()
  {
    String[] result = RiTa.getPosTags("a fucking fool", false);
    System.out.println(Arrays.asList(result)); // [dt, nn, nn]
    String[] answer = new String[] { "jj" };
    // deepEqual(result, answer);

    result = RiTa.getPosTags("shitting", false);
    System.out.println(Arrays.asList(result)); // [nn]
    answer = new String[] { "jj" };
    // deepEqual(result, answer);

    result = RiTa.getPosTags("shitty", false);
    System.out.println(Arrays.asList(result)); // [nn]
    answer = new String[] { "jj" };
    // deepEqual(result, answer);

    result = RiTa.getPosTags("shitty", true);
    System.out.println(Arrays.asList(result)); // [n]
    answer = new String[] { "a" };
    deepEqual(result, answer);
  }

The problem lies in line 204 of PrillPosTagger.java:

 if (data == null || data.length == 0) {

 //choices[i] = word.endsWith("s") ? NOUNP : NOUN;
 result[i] = word.endsWith("s") ? "nns" : "nn";
 }

Reference for correct results: http://nlp.stanford.edu:8080/parser/index.jsp

a/DT fucking/VBG fool/NN

dhowe commented 9 years ago

Do you have some other suggestions?

CyrusSUEN commented 9 years ago

Bigger data file. The English tagger files from Stanford POS tagger range from 12 to 15MB in size.

Or run the Stanford POS tagger as a server for fallback request http://nlp.stanford.edu/software/pos-tagger-faq.shtml#e

dhowe commented 8 years ago

closing for now