Enable CoNLL output. - Githubissues

brendano / stanford_corenlp_pywrapper

151 stars 59 forks source link

Enable CoNLL output. #31

Open ayrtonmassey opened 9 years ago

ayrtonmassey commented 9 years ago

This patch adds the CoNLL output of Stanford CoreNLP to the JSON annotation.

The data is returned in two forms:

In its raw form as conll_raw, in the same format as given when CoreNLP is run from the command line using the flag -outputFormat conll
Per-sentence as deps_conll, which adds CoNLL dependencies to each sentence.

To enable the CoNLL output, pass "outputFormat": "conll" in the configdict when creating a new CoreNLP instance.

ayrtonmassey commented 9 years ago

There's a couple of issues with the code I've written - firstly, both the new functions throw exceptions (IOException due to writing to an OutputStream and NumberFormatException because of the use of Integer.parseInt()).

I added catch blocks for them but I wasn't sure how to respond - in the case of an IOException the CoNLL annotation will not occur but the rest of the annotation will be returned. However, a NumberFormatException will result in sentences which have been annotated having a CoNLL annotation and others not.

I doubt either of these will occur since the output is taken directly from CoreNLP, but it's possible.

I'm also not sure what happens if a blank document is given - it just occurred to me to test that now.

This is my first pull request, so I apologise if it's a bit messed up!

brendano commented 9 years ago

thanks! one question i have is, what's the purpose of having conll output? if it's to be compatible with other systems that want to input or output conll format, why is the version here slightly different ... using json objects instead of the tab-separated format in conll? or, why use this wrapper code at all instead of using corenlp directly? what exactly is the use case?

On Thu, Aug 20, 2015 at 11:36 AM, Ayrton Massey notifications@github.com wrote:

There's a couple of issues with the code I've written - firstly, both the new functions throw exceptions (IOException due to writing to an OutputStream and NumberFormatException because of the use of Integer.parseInt()).

I added catch blocks for them but I wasn't sure how to respond - in the case of an IOException the CoNLL annotation will not occur but the rest of the annotation will be returned. However, a NumberFormatException will result in sentences which have been annotated having a CoNLL annotation and others not.

I'm also not sure what happens if a blank document is given - it just occurred to me to test that now.

This is my first pull request, so I apologise if it's a bit messed up!

— Reply to this email directly or view it on GitHub https://github.com/brendano/stanford_corenlp_pywrapper/pull/31#issuecomment-133054249 .

ayrtonmassey commented 9 years ago

I'm trying to use SEMAFOR to perform Semantic Frame Analysis, which accepts CoNLL data as input. Since I'm already using the wrapper for NER/coref it'd be nice to get the CoNLL output as well, rather than running a separate program. This means I don't have to:

Run two instances of Stanford CoreNLP - one with the wrapper for NER/coref, the other directly to obtain CoNLL output.
Try to integrate a separate system e.g. MaltParser.

If the wrapper is already doing the annotation, I may as well have it produce the CoNLL output too - especially as the wrapper is already integrated with my software.

I did include the raw tab-separated CoNLL data under "conll_raw" since I wasn't sure which was preferable - for some reason Stanford uses their own CoNLL format instead of CoNLL-X or CoNLL-U. For me, including the CoNLL data per-sentence as JSON objects allows me to reconstruct the data in CoNLL-X format, although I assume people looking to use this feature would want the raw data, so I included both.