PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

"NA TOLL" error #1

Closed ChristophLeonhardt closed 3 years ago

ChristophLeonhardt commented 6 years ago

NA - both in capital letters - is passed as an empty string and not encoded correctly later on. Might be a cwbtools issue as well.

ablaette commented 3 years ago

We had realized that this issue occurs when the token "NA" happens to be in the token stream. The solution is to set argument na.string of fread() to NULL when reading in annotated corpus data. As you may forget this easily, I implemented a corenlp_parse_conll() function (javamultithreading branch).