JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.77k stars 705 forks source link

DependencyParserApproach throws "IllegalArgumentException: For input string: "_"" when training with CONLLU dataset #14214

Closed Arierref46 closed 3 months ago

Arierref46 commented 4 months ago

Is there an existing issue for this?

Who can help?

No response

What are you working on?

I have been trying to train a DependencyParserApproach() but for some reason I have this error (IllegalArgumentException: For input string: "_") and I don't know why. I am using a public train dataset called bosque to train the model with the file "pt_bosque-ud-train.conllu" (https://github.com/UniversalDependencies/UD_Portuguese-Bosque).

Current Behavior

The DependencyParserApproach() throws the error when the .fit() function is called.

Expected Behavior

The DependencyParserApproach() should train normally.

Steps To Reproduce

https://colab.research.google.com/drive/1wyyJfdNSfm0C-r7-h2ri0Xrmzyw6nzgT?usp=sharing

Spark NLP version and Apache Spark

spark 3.3.1 spark-nlp 5.3.2

Type of Spark Application

Python Application

Java Version

openjdk version "1.8.0_402"

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

windows 11

Link to your project (if available)

No response

Additional Information

No response

maziyarpanahi commented 4 months ago

I share some links here just in case

I am not sure about that data type, but I just tested a file that is like this:

# sent_id = weblog-juancole.com_juancole_20030911085700_ENG_20030911_085700-0022
# text = It should continue to be defanged.
1   It  it  PRON    PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  3   nsubj   3:nsubj|6:nsubj:xsubj   _
2   should  should  AUX MD  VerbForm=Fin    3   aux 3:aux   _
3   continue    continue    VERB    VB  VerbForm=Inf    0   root    0:root  _
4   to  to  PART    TO  _   6   mark    6:mark  _
5   be  be  AUX VB  VerbForm=Inf    6   aux:pass    6:aux:pass  _
6   defanged    defange VERB    VBN Tense=Past|VerbForm=Part|Voice=Pass 3   xcomp   3:xcomp SpaceAfter=No
7   .   .   PUNCT   .   _   3   punct   3:punct _

# sent_id = weblog-blogspot.com_healingiraq_20040409053012_ENG_20040409_053012-0015
# text = So what happened?
1   So  so  ADV RB  _   3   advmod  3:advmod    _
2   what    what    PRON    WP  PronType=Int    3   nsubj   3:nsubj _
3   happened    happen  VERB    VBD Mood=Ind|Tense=Past|VerbForm=Fin    0   root    0:root  SpaceAfter=No
4   ?   ?   PUNCT   .   _   3   punct   3:punct _

# sent_id = weblog-typepad.com_ripples_20040407125600_ENG_20040407_125600-0055
# text = That too was stopped.
1   That    that    PRON    DT  Number=Sing|PronType=Dem    4   nsubj:pass  4:nsubj:pass    _
2   too too ADV RB  _   4   advmod  4:advmod    _
3   was be  AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   4   aux:pass    4:aux:pass  _
4   stopped stop    VERB    VBN Tense=Past|VerbForm=Part|Voice=Pass 0   root    0:root  SpaceAfter=No
5   .   .   PUNCT   .   _   4   punct   4:punct _
Arierref46 commented 4 months ago

Can you try with more data? For some reason, I can run with 3 examples from the bosque dataset, but when I added more examples it crashes. Also, can be related to the data is written in portuguese?

maziyarpanahi commented 4 months ago

Can you try with more data? For some reason, I can run with 3 examples from the bosque dataset, but when I added more examples it crashes. Also, can be related to the data is written in portuguese?

That's interesting! This might be a bug. There is probably a character or a token it doesn't like, it shouldn't crash in my opinion and just skip that row/sentence.

Will assign this for further inspection.

Arierref46 commented 3 months ago

This seems great news! How can I install this fix?

danilojsl commented 3 months ago

This seems great news! How can I install this fix?

@Arierref46 you just need to update to the latest version of spark-nlp==5.3.3