clab / lstm-parser

Transition-based dependency parser based on stack LSTMs
Apache License 2.0
204 stars 63 forks source link

Error when parsing multiword expressions in conllu file #26

Closed sb-b closed 6 years ago

sb-b commented 6 years ago

Hi,

I am trying to train this parser on Turkish UD Treebank. When I run this command:

java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt

I got the following error:

java.lang.NumberFormatException: For input string: "2-3"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at arc_std_swap.Oracle.getTransition(Oracle.java:41)
        at arc_std_swap.Parser.printOracle(Parser.java:366)
        at arc_std_swap.Parser.main(Parser.java:270)

The conllu parse the lstm parser gives error is the one below:

# sent_id = mst-0003
# text = Sanal parçacıklarsa bunların hiçbirini yapamazlar.
1   Sanal   sanal   ADJ Adj _   2   amod    _   _
2-3 parçacıklarsa   _   _   _   _   _   _   _   _
2   parçacıklar parçacık    NOUN    Noun    Case=Nom|Number=Plur|Person=3   6   csubj   _   _
3   sa  i   AUX Zero    Aspect=Perf|Mood=Cnd|Number=Sing|Person=3|Tense=Pres    2   cop _   _
4   bunların    bu  PRON    Demons  Case=Gen|Number=Plur|Person=3|PronType=Dem  5   nmod:poss   _   _
5   hiçbirini   hiçbiri PRON    Quant   Case=Acc|Number=Sing|Number[psor]=Sing|Person=3|Person[psor]=3|PronType=Ind 6   obj _   _
6   yapamazlar  yap VERB    Verb    Aspect=Imp|Mood=Pot|Number=Plur|Person=3|Polarity=Neg|Tense=Aor 0   root    _   SpaceAfter=No
7   .   .   PUNCT   Punc    _   6   punct   _   _

The word 'parçacıklarsa' is a multiword token, so it is numbered as '2-3'. Does lstm parser have a mechanism to deal with multiword tokens? How can I solve this issue?

Thanks,

Betul

miguelballesteros commented 6 years ago

Hi! This is conllu format, the parser only handles conll format. Please see the universal dependencies scripts.

Miguel

sb-b commented 6 years ago

Hi,

I couldn't find an appropriate script for converting conll-u files to conll files. I will be glad if you can suggest me a script for this task.

Thanks,

Betul

On Wed, Feb 14, 2018 at 3:18 PM, Miguel Ballesteros < notifications@github.com> wrote:

Hi! This is conllu format, the parser only handles conll format. Please see the universal dependencies scripts.

Miguel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/clab/lstm-parser/issues/26#issuecomment-365590422, or mute the thread https://github.com/notifications/unsubscribe-auth/AEEx3sCE39H-ErylAeEh8S8zfF5aX8w3ks5tUs7_gaJpZM4SFE1v .

miguelballesteros commented 6 years ago

I believe this is the one: https://github.com/UniversalDependencies/tools/blob/f21108176ff431ebbab4c9414d6e0345e62d3995/conllu_to_conllx.pl

sb-b commented 6 years ago

It worked, thank you!

On Wed, Feb 14, 2018 at 8:57 PM, Miguel Ballesteros < notifications@github.com> wrote:

I believe this is the one: https://github.com/UniversalDependencies/tools/ blob/f21108176ff431ebbab4c9414d6e0345e62d3995/conllu_to_conllx.pl

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/clab/lstm-parser/issues/26#issuecomment-365691455, or mute the thread https://github.com/notifications/unsubscribe-auth/AEEx3n2nfnstJ8In9Wb0pu41MXnGbx9_ks5tUx6QgaJpZM4SFE1v .