brendano / stanford_corenlp_pywrapper

151 stars 59 forks source link

entities vs. entitymentions in API #22

Open AbeHandler opened 9 years ago

AbeHandler commented 9 years ago

Hi @brendano -- I'm trying to find cases where named entities co-refer within a document. For instance: "The Orleans Parish School Board is set to consider the proposal from the teacher's union. OPSB rejected a similar proposal at last month's meeting."

This seems fairly cumbersome w/ the current API. Each ['sentence'] has an ['entitymentions'] with a ['tokspan'] attribute. The 'tokspan' counts based on the position in the document. So if sentence 1 is 10 tokens, then sentence 2 might have a tokspan 11-13 for some named entity.

 "entitymentions": [{"charspan": [52, 61], "sentence": 1, "normalized": "THIS P1W OFFSET P1W",     "type": "DATE", "tokspan": [10, 12], "timex_xml": "<TIMEX3 ...</TIMEX3>"}, {"charspan": [129, 139], "type": "MISC", "tokspan": [24, 25], "sentence": 1}, {"charspan": [226, 238], "type": "PERSON", "tokspan": [40, 42], "sentence": 1}],

So far so good, but the ['entities'] counter goes sentence by sentence, giving a tokspan for each mention w/in a sentence.

{"mentions": {"head": 2, "animacy": "ANIMATE", "sentence": 11, "gender": "UNKNOWN", "mentionid": 84, "mentiontype": "PROPER", "number": "SINGULAR", "tokspan_in_sentence": [2, 3]}

This is workaroundable... but I wonder if the wrapper might be improved by changing the way tokspan is calculated for a given ['sentence'] -- or, alternately, adding a ['tokspan_in_sentence'] to each mention in a sentence's 'entitymentions'. In my opinion, it make sense to give a tokspan that is limited to a given sentence, within a given ['sentence'] object.

If that change to the API will break everything that uses this wrapper, then maybe it's not worth it. But it does seem sort of confusing to fresh eyes.

See what I am getting at? Happy to work around or fork if you don't feel like mucking with it.

brendano commented 9 years ago

yeah it would be great to have sentence relative token positions, perhaps as well as doc relative positions. the deep internals of corenlp like to use docrelative ones as far as i could figure out but maybe i didnt look hard enough. take a look if you want. also check their xml output code (which my code is based on) to see if they have it somewhere.

On Sunday, June 14, 2015, Abe Handler notifications@github.com wrote:

Hi @brendano https://github.com/brendano -- I'm trying to find cases where named entities co-refer within a document. For instance: "The Orleans Parish School Board is set to consider the proposal from the teacher's union. OPSB rejected a similar proposal at last month's meeting."

This seems fairly cumbersome w/ the current API. Each ['sentence'] has an ['entitymentions'] with a ['tokspan'] attribute. The 'tokspan' counts based on the position in the document. So if sentence 1 is 10 tokens, then sentence 2 might have a tokspan 11-13 for some named entity.

"entitymentions": [{"charspan": [52, 61], "sentence": 1, "normalized": "THIS P1W OFFSET P1W", "type": "DATE", "tokspan": [10, 12], "timex_xml": "<TIMEX3 ..."}, {"charspan": [129, 139], "type": "MISC", "tokspan": [24, 25], "sentence": 1}, {"charspan": [226, 238], "type": "PERSON", "tokspan": [40, 42], "sentence": 1}],

So far so good, but the ['entities'] counter goes sentence by sentence, giving a tokspan for each mention w/in a sentence.

{"mentions": {"head": 2, "animacy": "ANIMATE", "sentence": 11, "gender": "UNKNOWN", "mentionid": 84, "mentiontype": "PROPER", "number": "SINGULAR", "tokspan_in_sentence": [2, 3]}

This is workaroundable... but I wonder if the wrapper might be improved by changing the way tokspan is calculated for a given ['sentence'] -- or, alternately, adding a ['tokspan_in_sentence'] to each mention in a sentence's 'entitymentions'. In my opinion, it make sense to give a tokspan that is limited to a given sentence, within a given ['sentence'] object.

If that change to the API will break everything that uses this wrapper, then maybe it's not worth it. But it does seem sort of confusing to fresh eyes.

See what I am getting at? Happy to work around or fork if you don't feel like mucking with it.

— Reply to this email directly or view it on GitHub https://github.com/brendano/stanford_corenlp_pywrapper/issues/22.

-brendan [mobile]

brendano commented 9 years ago

or to put it another way: yes, the inconsistency is really lame. if you can wrestle out something with more consistency out of corenlp go for it! this software is just a little layer on top of it and i am often have a hard time figuring out their stuff

AbeHandler commented 9 years ago

I often have a hard time figuring out their stuff too. This layer is a good idea -- I've wasted a bunch of time knitting Stanford components for one-off projects in Java. I will poke around and see if there a way to give per-sentence and per-document offsets in the json

AbeHandler commented 9 years ago

Hrm. So I guess the question is if this change should go in the python portion of stanford_corenlp_pywrapper or in the java portion of stanford_corenlp_pywrapper.

I have not actually run the java through a debugger, but based on reading the code it seems like the token numbers are coming out from the CoreNLP pipeline.

https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/javasrc/corenlp/JsonPipeline.java#L302

If that is how it is in fact working, then making the changing in the Python seems way, way easier as it does not require digging into the source for the CoreNLP tools which would be a huge pain. So maybe some kind of 'post processing' method hereabouts that cleans up the output from CoreNLP? https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/sockwrap.py#L200

brendano commented 9 years ago

i think you want to look at addEntityMentions() in the java code (i found it by searching for "entitymentions" the keyname in the json output) and see also the corenlp webpage which lists all the annotations they have.

postproc is dangerous bcs it's a maintainability burden. what if their code changes, what if the assumptions behind the postproc are wrong. otoh if you have to do it anyways, might as well write it into this level.

On Mon, Jun 15, 2015 at 12:17 AM, Abe Handler notifications@github.com wrote:

Hrm. So I guess the question is if this change should go in the python portion of stanford_corenlp_pywrapper or in the java portion of stanford_corenlp_pywrapper.

I have not actually run the java through a debugger, but based on reading the code it seems like the token numbers are coming out from the CoreNLP pipeline.

https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/javasrc/corenlp/JsonPipeline.java#L302

If that is how it is in fact working, then making the changing in the Python seems way, way easier as it does not require digging into the source for the CoreNLP tools which would be a huge pain. So maybe some kind of 'post processing' method hereabouts that cleans up the output from CoreNLP. https://github.com/brendano/stanford_corenlp_pywrapper/blob/master/stanford_corenlp_pywrapper/sockwrap.py#L200

— Reply to this email directly or view it on GitHub https://github.com/brendano/stanford_corenlp_pywrapper/issues/22#issuecomment-111916699 .

AbeHandler commented 9 years ago

Yah. I see your point on post processing. I will dig through the corenlp stuff, which I should know in the first place.