CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
http://nlp.cogcomp.org/
Other
470 stars 144 forks source link

Check for duplicate constituents before adding #684

Closed mayhewsw closed 5 years ago

mayhewsw commented 6 years ago

This addresses and closes #664

danyaljj commented 6 years ago

Can you check why the tests are failing? (or if they are due to the changes here?) http://morgoth.cs.illinois.edu:5800/viewLog.html?buildId=744&buildTypeId=CogcompNlp_Build&tab=buildLog&filter=err

Likely they are, since the master is passing the tests.

mayhewsw commented 6 years ago

There's a subtle bug that happens in the ProtobufSerializerTest. When a TA is read back from protobuf, it refuses to add a constituent because it is duplicate, but this constituent participates in some relation. The null pointer comes because there's a relation that's missing one of the argument constituents.

The original TA comes from the DummyTextAnnotationGenerator. Presumably what happens is that two constituents are created in the SRL_VERB view, identical except for the relations they participate in. These can be added separately to the view, because relations are counted in equality check.

Then, when it is reloaded, the (identical) constituents are created, added (fail), then relations are given to them.

I think the issue now is with the DummyTextAnnotationGenerator SRL_VERB view generation.

mayhewsw commented 6 years ago

It comes down to this: should constituent equality include relations (in or out). That is, imagine I have constituents A, B, and C. Consider the link A-B. Now I create a new constituent B' that is identical in every way to B, except it is not connected to A. Should B.equals(B') = true?

mssammon commented 6 years ago

This is partly an issue of how SRL frames should be instantiated: if a single argument span occurs in multiple frames, should the annotator create multiple spans or a single span shared across frames? My feeling is that it should be a single span across frames. In this case, equality check should not include incoming/outgoing relations. @danyaljj any objections? @mayhewsw Assuming @danyaljj is on board, please make the changes to DummyTextAnnotationGenerator. We also need to make sure SRL annotators adhere to this constraint; perhaps using pipeline with an input that reliably generates multiple verb frames sharing arguments.

danyaljj commented 6 years ago

Good by me.

mayhewsw commented 5 years ago

All tests pass. Changes are substantial, so please do take a look. In general, duplicate constituents are discouraged, but it is possible to force a view to accept duplicate constituents. For example, in a parse tree, one has the label level (with label . (period)) and the leaf, which is a child, has the same span and the same label. These are technically identical, but need to be represented separately.

One big change: equality of constituents does not take relations into account at all. I think this is the right way.

danyaljj commented 5 years ago

Two small comments; otherwise it looks good to me.