PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

Stanford CoreNLP 4.0.0 | German NER model leads to broken json #13

Closed ChristophLeonhardt closed 3 years ago

ChristophLeonhardt commented 4 years ago

The following issue does not apply to the current default version of bignlp. The Stanford CoreNLP version downloaded there (3.9.2) still works out of the box.

However, after updating to Stanford CoreNLP 4.0.0 and using the German NER Model (german.distsim.crf.ser.gz), in the current workflow corenlp_annotate() returns ndjsons which cannot be parsed by corenlp_parse_ndjson(). The culprit is the broken formatting of the nerConfidences attribute in the output.

...
"nerConfidences": { "PERSON": 0,9999 }
...

With the comma decimal separator, the parser expects a string value. For now, I applied a quick fix replacing the comma before parsing the ndjson.

x <- readLines(ndjson_files)
x2 <- gsub("\\s(\\d*),(\\d+)", " \\1\\.\\2", x)
writeLines(x2, ndjson_files)

I will leave this issue in case the default version of bignlp is updated in the future and the error resurfaces.

ablaette commented 4 years ago

This is nasty. I tried to find out whether there is some kind of general setting or option in R that can be used to set the digit seperator differently. But I do not think so, see this trial which does not yield the result we would hope for.

library(jsonlite)
x <- '{"nerConfidences": { "PERSON": 0,9999 }}'
getOption("OutDec") # is "." by default
options(OutDec = ",")
jsonlite::parse_json(x)

For read.table(), you could write somethin like `read.table("/PATH/TO/FILE", dec = ","), but this does not work neither.

There most be some option in Java, because the behaviour of the output changes depending on the language used. So I tried to consult the Java API documentation of StanfordCore 4.0.0, but is not publicly accessible. I also tried to look up how the output for "nerConfidences" is generated in the JSONoutputter class (code at GitHub), but my knowledge of Java is too limited to see how output behaviour could be changed.

One possibility might be to file an issue at StanfordCoreNLP at GitHub.

For the time being, you hack is the best we have.

ChristophLeonhardt commented 4 years ago

Thank you for the swift response. I fiddled around with jsonlite myself to no avail, unfortunately. I would agree that there should be some way to control the output on the java side. I will try to come up with a simple example to file an issue to StanfordCoreNLP on GitHub.

The hack I proposed earlier is really ugly. Worse however, it is a bit faulty because it doesn't take into account the possibility of parallelization, i.e. it won't work if you have more than one ndjson-file. I changed it like that to take care of this. It's still rather ugly.

lapply(ndjson_files, 
       function(ndjson_file) {
         x <- readLines(ndjson_file)
         x2 <- gsub("\\s(\\d*),(\\d+)", " \\1\\.\\2", x)
         writeLines(x2, ndjson_file)
       }
)
ablaette commented 4 years ago

Thanks for finding this related issue: https://github.com/stanfordnlp/CoreNLP/issues/1056

So it is very likely that updating to CoreNLP 4.1.0 will solve the problem. If it works, the corenlp_install() function needs to be updated. At present, downloading an outdated version of CoreNLP is hardcoded twice into the function: https://github.com/PolMine/bignlp/blob/18cefacc6c50a5d939ab43bfa0d6fcbf558834ae/R/corenlp.R#L127 https://github.com/PolMine/bignlp/blob/18cefacc6c50a5d939ab43bfa0d6fcbf558834ae/R/corenlp.R#L135

The Download button on the CoreNLP website now offers a link that will seemingly always get you the latest CoreNLP version: http://nlp.stanford.edu/software/stanford-corenlp-latest.zip

I guess we should use this link, and include functionality to inform about the CoreNLP version used.

ChristophLeonhardt commented 4 years ago

I wanted to add that the model versions are hardcoded as well.

https://github.com/PolMine/bignlp/blob/8d5fdcf47a0e040d69a59943b1c15c25455e84df/R/corenlp.R#L377

Unfortunately, the links to the model files aren't as conveniently generic but version specific: http://nlp.stanford.edu/software/stanford-corenlp-4.1.0-models-german.jar

ablaette commented 3 years ago

Adding to our earlier conversation, I just updated the CoreNLP version to 4.2.0 wherever necessary, including thee German properties file.

ablaette commented 3 years ago

The latest version of bignlp (javamultithreading branch) relies on CoreNLP v4.2.0 throughout. The issue has been fixed with the latest CoreNLP version and I have not seen it occurr again.