CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
http://nlp.cogcomp.org/
Other
470 stars 144 forks source link

On behavior of Chunker on words with "-" #330

Closed danyaljj closed 7 years ago

danyaljj commented 7 years ago

Consider this sentence: "Notre Dame students had a showdown in 1924 with which anti-catholic group?"

Using the demo I get: screen shot 2017-01-04 at 12 34 16 am which is correct.

While using the latest version of the pipeline I get:

[info] Notre Dame students: NP
[info] had: VP
[info] a showdown: NP
[info] in: PP
[info] 1924: NP
[info] with: PP
[info] which anti: NP
[info] catholic group: NP

which separates "anti" and "catholic". Any ideas if this is result of slight weight change for different weights, there is a systematic issue resulting in this?

danyaljj commented 7 years ago

I tried a few more examples.

[info] text: My estate goes to my husband, son, daughter-in-law, and nephew.
[info] My estate: NP
[info] goes: VP
[info] to: PP
[info] my husband: NP
[info] son: NP
[info] daughter: NP
[info] in: PP
[info] -law: NP
[info] nephew: NP
--------
[info] text: Notre Dame students had a showdown in 1924 with which anti-catholic group?
[info] Notre Dame students: NP
[info] had: VP
[info] a showdown: NP
[info] in: PP
[info] 1924: NP
[info] with: PP
[info] which anti: NP
[info] catholic group: NP
--------
[info] text: He is a well-respected man.
[info] He: NP
[info] is: VP
[info] a well: NP
[info] respected man: NP
--------
[info] text: We knew there would be a stop-off in Singapore for refuelling.
[info] We: NP
[info] knew: VP
[info] there: NP
[info] would be: VP
[info] a stop: NP
[info] off: ADVP
[info] in: PP
[info] Singapore: NP
[info] for: PP
[info] refuelling: NP
--------
[info] text: The house was unoccupied at the time of the break-in.
[info] The house: NP
[info] was: VP
[info] unoccupied: ADJP
[info] at: PP
[info] the time: NP
[info] of: PP
[info] the break: NP
[info] in: PP
--------
[info] text: There was a build-up of traffic on the ring road.
[info] There: NP
[info] was: VP
[info] a build: NP
[info] up: PP
[info] of: PP
[info] traffic: NP
[info] on: PP
[info] the ring road: NP
--------
[info] text: The design is state-of-the-art.
[info] The design: NP
[info] is: VP
[info] state: NP
[info] of: PP
[info] the-art: NP

In all the examples this issue happens. So I think it's a systematic problem.

danyaljj commented 7 years ago

Looking into the raw chunker labels,

( PRP$ My) -> B-NP
( NN estate) -> I-NP
( VBZ goes) -> B-VP
( TO to) -> B-PP
( PRP$ my) -> B-NP
( NN husband) -> I-NP
( , ,) -> O
( NN son) -> B-NP
( , ,) -> O
( NN daughter) -> B-NP
( : -) -> O
( IN in) -> B-PP
( : -) -> B-NP
( NN law) -> I-NP
( , ,) -> O
( CC and) -> O
( NN nephew) -> B-NP
( . .) -> O

Result --> [NP My estate ] [VP goes ] [PP to ] [NP my husband ] [NP son ] [NP daughter ] [PP in ] [NP - law ] [NP nephew ] 
danyaljj commented 7 years ago

I wonder how demo differs from the current Chunker annotator . . .

mssammon commented 7 years ago

This could be due to a change in the Tokenizer behavior. Chunker model needs retraining. @qiangning Please take note... this should happen automatically when you retrain chunker with single-precision branch, but is a possible source of differences in output.

qiangning commented 7 years ago

This means when I trained the current chunker model, our tokenizer had some issues with "-"s, so now I need to train it again. Am I correct? @mssammon

mssammon commented 7 years ago

@qiangning, @danyaljj, @cowchipkid : The relevant change was here: https://github.com/IllinoisCogComp/illinois-cogcomp-nlp/pull/300. I am not 100% sure how this interleaved with retraining chunker and POS. From what Daniel reported, it looks like the tokenizer behavior changed. According to the comments on the PR referenced above, the default behavior of the tokenizer did not change. @danyaljj, can you confirm whether or not tokenizer behavior is different in the Curator vs. locally running the Chunker with defaults?, @cowchipkid , can you verify default tokenization behavior is the same before and after the PR where you changed the tokenization behavior?

danyaljj commented 7 years ago

I can confirm that setting splitOnDash = false would fix the issue, at least on the examples I have:

[My, estate, goes, to, my, husband, ,, son, ,, daughter-in-law, ,, and, nephew, .]
[( PRP$ My), ( NN estate), ( VBZ goes), ( TO to), ( PRP$ my), ( NN husband), ( , ,), ( NN son), ( , ,), ( NN daughter-in-law), ( , ,), ( CC and), ( NN nephew), ( . .)]
( PRP$ My) -> B-NP
( NN estate) -> I-NP
( VBZ goes) -> B-VP
( TO to) -> B-PP
( PRP$ my) -> B-NP
( NN husband) -> I-NP
( , ,) -> O
( NN son) -> B-NP
( , ,) -> O
( NN daughter-in-law) -> B-NP
( , ,) -> O
( CC and) -> O
( NN nephew) -> B-NP
( . .) -> O
[POS, SHALLOW_PARSE, TOKENS, SENTENCE]
My estate goes to my husband , son , daughter-in-law , and nephew . 
[NP My estate ] [VP goes ] [PP to ] [NP my husband ] [NP son ] [NP daughter-in-law ] [NP nephew ] 

But now a question for me is the reason to have splitOnDash = true as done in #300?

mssammon commented 7 years ago

Then this sounds like the documentation for the tokenizer needs to be clarified, and we should retrain and deploy models for POS and Chunker if they have not already been trained using this setting. Initially, this retraining should only be done if needed for the work on the single-precision branch, since we are planning to release updated models anyway. @qiangning @cgraber @cowchipkid

danyaljj commented 7 years ago

For me, more important thing is the default behavior of the tokenizer (especially for a pipeline user, it's not trivial to have access to this default setting). Now splitOnDash = true, while splitOnDash = false make more sense to me. Do we have any strong reasons to keep it true?

Also I think we have to expand the unit-test for the tokenizer. Here are a few examples that could partially cover the issue here:

    @Test
    public void testAnnotator() {
        ChunkerAnnotator chunkAnnotator = new ChunkerAnnotator(false, new ChunkerConfigurator().getDefaultConfig());
        POSAnnotator posAnnotator = new POSAnnotator(false);
        String s1 = "My estate goes to my husband, son, daughter-in-law, and nephew.";
        String s2 = "Notre Dame students had a showdown in 1924 with which anti-catholic group?";
        String s3 =  "He is a well-respected man.";
        String s4 = "We knew there would be a stop-off in Singapore for refuelling.";
        String s5 = "The house was unoccupied at the time of the break-in.";
        String s6 = "There was a build-up of traffic on the ring road.";
        String s7 = "The design is state-of-the-art.";
        String s8 = "The slacker video-gamed his way through life.";
        String s9 = "Queen Victoria throne-sat for six decades.";
        String s10 = "I changed my diet and became a no-meater.";
        String s11 = "No-meater is too confusing without the hyphen.";

        TextAnnotationBuilder builder = new TokenizerTextAnnotationBuilder(new StatefulTokenizer(false));
        TextAnnotation ta = builder.createTextAnnotation("", "", s1);
        try {
            posAnnotator.addView(ta);
            chunkAnnotator.addView(ta);
        } catch (AnnotatorException e) {
            e.printStackTrace();
        }
        System.out.println(ta.getAvailableViews());
        System.out.println(ta.getView(ViewNames.TOKENS));
        System.out.println(ta.getView(ViewNames.SHALLOW_PARSE));

        //assertTrue(annotator.getTagValues().equals(set));
    }
mssammon commented 7 years ago

The main reason to keep it true is to allow NER tagging/coreference linking of hyphenated terms, like "U.S.-led". What is the compelling reason to set the default to false? It could be just that this is the way we used to do it; or maybe, that this is the way that WSJ annotates -- @christos-c is that true? Even if those latter reasons are correct, there is one reason to prefer split as default: it seems intuitively that it is easier to annotate the finer split with POS and NER etc. and then combine them if you want to identify properties of compound terms, rather than to do these finer-grained annotations in a post-processing step. If we have some pertinent data, I think this could be an interesting research project.

christos-c commented 7 years ago

Originally, the PennTreebank didn't split on hyphens, but around the time of the Ontonotes corpus release, they systematically started splitting them up: http://cemantix.org/papers/pradhan-hlt-2009-ontonotes-tutorial.pdf (see slide 27) I also think that it's better to split from a syntactic/semantic point of view and the AMR guidelines seem to agree.

mssammon commented 7 years ago

@cgraber @qiangning The single-precision branch has the split-by-default tokenization behavior. Qiang, you can ignore the failing test for now as long as you can train and evaluate the singleprecision chunker models (or you can simply change the failing test). Assuming that there are no problems -- and it looks from pos and NER that there aren't -- then we can merge this branch.