gchrupala / morfette

Supervised learning of morphology
BSD 2-Clause "Simplified" License
28 stars 5 forks source link

some words don't get a lemma at all #19

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. train morfette on half the Lemmatized Penn Treebank
2. morfetize the following sentence

Fans    NNS
of      IN
Anthony NNP
Andrews NNP
-LRB-   -LRB-
Brideshead      NNP
Revisited       NNP
-RRB-   -RRB-
will    MD
relish  VB
watching        VBG
him     PRP
play    VB
the     DT
title   NN
role    NN
-LRB-   -LRB-
s       PRP
-RRB-   -RRB-
in      IN
the     DT
19th-century    NN
Robert  NNP
Louis   NNP
Stevenson       NNP
pre-Freudian    JJ
drama   NN
of      IN
schizoid        JJ
horror  NN
.       .
3. the output will be (note the s NNS)
Fans fan NNS
of of IN
Anthony anthony NNP
Andrews andrews NNP
-LRB- -lrb- -LRB-
Brideshead brideshead NNP
Revisited revisited NNP
-RRB- -rrb- -RRB-
will will MD
relish relish VB
watching watch VBG
him he PRP
play play VB
the the DT
title title NN
role role NN
-LRB- -lrb- -LRB-
s  NNS
-RRB- -rrb- -RRB-
in in IN
the the DT
19th-century 19th-century NN
Robert robert NNP
Louis louis NNP
Stevenson stevenson NNP
pre-Freudian pre-freudian JJ
drama drama NN
of of IN
schizoid schizoid JJ
horror horror NN
. . .

What is the expected output? What do you see instead?

the output will be (note the s NNS)
Fans fan NNS
of of IN
Anthony anthony NNP
Andrews andrews NNP
-LRB- -lrb- -LRB-
Brideshead brideshead NNP
Revisited revisited NNP
-RRB- -rrb- -RRB-
will will MD
relish relish VB
watching watch VBG
him he PRP
play play VB
the the DT
title title NN
role role NN
-LRB- -lrb- -LRB-
s  NNS
-RRB- -rrb- -RRB-
in in IN
the the DT
19th-century 19th-century NN
Robert robert NNP
Louis louis NNP
Stevenson stevenson NNP
pre-Freudian pre-freudian JJ
drama drama NN
of of IN
schizoid schizoid JJ
horror horror NN
. . .

morfette should check that every ouput line has 3 column and if not display a 
warning and provide a fallback mode with the copy of the wordform as a lemma if 
none can't be found

Original issue reported on code.google.com by djame.seddah@gmail.com on 15 Dec 2011 at 11:16

GoogleCodeExporter commented 9 years ago
I think the problem must be that the resulting lemma is the empty string (after 
removing the "s" suffix). The solution would be to filter out empty strings and 
candidate lemmas. Can you upload the model somewhere so I can try to fix this?

Original comment by pitekus on 15 Dec 2011 at 12:35

GoogleCodeExporter commented 9 years ago
I have checked in code to reject empty lemmas candidates.
Djame if you upload your model somewhere for me, or verify yourself that it 
works, I can prepare a new release with the few recent bugfixes, including this 
one.

Original comment by pitekus on 15 Dec 2011 at 1:40

GoogleCodeExporter commented 9 years ago
Hi grzegorz,
I've upload the ptb model here

http://pauillac.inria.fr/~seddah/ptb+lemma+new.10x3.model.half.tar.bz2

do you need some test text or the one I've put on the bug description is enough 
?

Original comment by djame.seddah@gmail.com on 15 Dec 2011 at 2:26

GoogleCodeExporter commented 9 years ago
OK I tried your model but I get a deserialization error. I'm not sure why that 
is: perhaps the version of the serialization package (binary) I have installed 
is not compatible with yours. Could you check out the code from the svn, 
recompile, and try your example?

Original comment by pitekus on 15 Dec 2011 at 2:44

GoogleCodeExporter commented 9 years ago
I assume this is working now. Setting status to fixed.

Original comment by pitekus on 24 Jan 2012 at 12:14