bnosac / RDRPOSTagger

R package for Ripple Down Rules-based Part-Of-Speech Tagging (RDRPOS). On more than 45 languages.
35 stars 13 forks source link

String index out of range error with leading symbols #2

Closed fmerhout closed 5 years ago

fmerhout commented 6 years ago

I came across an error when passing the tagger sentences that have a leading symbol like - or ?.

Here is an example:

rdr_pos(rdr_model(language = "English", annotation = "UniversalPOS"), "- what is wrong?")

Returns the following error:

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.StringIndexOutOfBoundsException: String index out of range: 0

It seems like this is an rJava error but I thought I'd post it here first.

jwijffels commented 6 years ago

Probably related. Can you let me know what rdr_add_space_around_punctuations("- what is wrong?") gives?

fmerhout commented 6 years ago

Here is the result of rdr_add_space_around_punctuations("- what is wrong?")

" - what is wrong ? "

Interestingly, the same does not happen with rdr_add_space_around_punctuations("+ what is wrong?")

"+ what is wrong ? "

jwijffels commented 6 years ago

Yes, that is the problem the rdr_add_space_around_punctuations does not tokenise correctly. Same issues as the other issue just reported. You need to make sure the first letter is not a space, tokenise correctly (every token is separated by a space) and flag add_space_around_punctuations=FALSE

shahronak47 commented 6 years ago

It is because of some punctuations in the text. Using removePunctuation(text) from tm package works for me.

jwijffels commented 5 years ago

Closing as solution was provided. It's up to the user to do tokenisation with this R package. If you need tokenisation, use the udpipe R package.