Closed danyaljj closed 7 years ago
I tried a few more examples.
[info] text: My estate goes to my husband, son, daughter-in-law, and nephew.
[info] My estate: NP
[info] goes: VP
[info] to: PP
[info] my husband: NP
[info] son: NP
[info] daughter: NP
[info] in: PP
[info] -law: NP
[info] nephew: NP
--------
[info] text: Notre Dame students had a showdown in 1924 with which anti-catholic group?
[info] Notre Dame students: NP
[info] had: VP
[info] a showdown: NP
[info] in: PP
[info] 1924: NP
[info] with: PP
[info] which anti: NP
[info] catholic group: NP
--------
[info] text: He is a well-respected man.
[info] He: NP
[info] is: VP
[info] a well: NP
[info] respected man: NP
--------
[info] text: We knew there would be a stop-off in Singapore for refuelling.
[info] We: NP
[info] knew: VP
[info] there: NP
[info] would be: VP
[info] a stop: NP
[info] off: ADVP
[info] in: PP
[info] Singapore: NP
[info] for: PP
[info] refuelling: NP
--------
[info] text: The house was unoccupied at the time of the break-in.
[info] The house: NP
[info] was: VP
[info] unoccupied: ADJP
[info] at: PP
[info] the time: NP
[info] of: PP
[info] the break: NP
[info] in: PP
--------
[info] text: There was a build-up of traffic on the ring road.
[info] There: NP
[info] was: VP
[info] a build: NP
[info] up: PP
[info] of: PP
[info] traffic: NP
[info] on: PP
[info] the ring road: NP
--------
[info] text: The design is state-of-the-art.
[info] The design: NP
[info] is: VP
[info] state: NP
[info] of: PP
[info] the-art: NP
In all the examples this issue happens. So I think it's a systematic problem.
Looking into the raw chunker labels,
( PRP$ My) -> B-NP
( NN estate) -> I-NP
( VBZ goes) -> B-VP
( TO to) -> B-PP
( PRP$ my) -> B-NP
( NN husband) -> I-NP
( , ,) -> O
( NN son) -> B-NP
( , ,) -> O
( NN daughter) -> B-NP
( : -) -> O
( IN in) -> B-PP
( : -) -> B-NP
( NN law) -> I-NP
( , ,) -> O
( CC and) -> O
( NN nephew) -> B-NP
( . .) -> O
Result --> [NP My estate ] [VP goes ] [PP to ] [NP my husband ] [NP son ] [NP daughter ] [PP in ] [NP - law ] [NP nephew ]
I wonder how demo differs from the current Chunker annotator . . .
This could be due to a change in the Tokenizer behavior. Chunker model needs retraining. @qiangning Please take note... this should happen automatically when you retrain chunker with single-precision branch, but is a possible source of differences in output.
This means when I trained the current chunker model, our tokenizer had some issues with "-"s, so now I need to train it again. Am I correct? @mssammon
@qiangning, @danyaljj, @cowchipkid : The relevant change was here: https://github.com/IllinoisCogComp/illinois-cogcomp-nlp/pull/300. I am not 100% sure how this interleaved with retraining chunker and POS. From what Daniel reported, it looks like the tokenizer behavior changed. According to the comments on the PR referenced above, the default behavior of the tokenizer did not change. @danyaljj, can you confirm whether or not tokenizer behavior is different in the Curator vs. locally running the Chunker with defaults?, @cowchipkid , can you verify default tokenization behavior is the same before and after the PR where you changed the tokenization behavior?
I can confirm that setting splitOnDash = false
would fix the issue, at least on the examples I have:
[My, estate, goes, to, my, husband, ,, son, ,, daughter-in-law, ,, and, nephew, .]
[( PRP$ My), ( NN estate), ( VBZ goes), ( TO to), ( PRP$ my), ( NN husband), ( , ,), ( NN son), ( , ,), ( NN daughter-in-law), ( , ,), ( CC and), ( NN nephew), ( . .)]
( PRP$ My) -> B-NP
( NN estate) -> I-NP
( VBZ goes) -> B-VP
( TO to) -> B-PP
( PRP$ my) -> B-NP
( NN husband) -> I-NP
( , ,) -> O
( NN son) -> B-NP
( , ,) -> O
( NN daughter-in-law) -> B-NP
( , ,) -> O
( CC and) -> O
( NN nephew) -> B-NP
( . .) -> O
[POS, SHALLOW_PARSE, TOKENS, SENTENCE]
My estate goes to my husband , son , daughter-in-law , and nephew .
[NP My estate ] [VP goes ] [PP to ] [NP my husband ] [NP son ] [NP daughter-in-law ] [NP nephew ]
But now a question for me is the reason to have splitOnDash = true
as done in #300?
Then this sounds like the documentation for the tokenizer needs to be clarified, and we should retrain and deploy models for POS and Chunker if they have not already been trained using this setting. Initially, this retraining should only be done if needed for the work on the single-precision branch, since we are planning to release updated models anyway. @qiangning @cgraber @cowchipkid
For me, more important thing is the default behavior of the tokenizer (especially for a pipeline user, it's not trivial to have access to this default setting).
Now splitOnDash = true
, while splitOnDash = false
make more sense to me. Do we have any strong reasons to keep it true
?
Also I think we have to expand the unit-test for the tokenizer. Here are a few examples that could partially cover the issue here:
@Test
public void testAnnotator() {
ChunkerAnnotator chunkAnnotator = new ChunkerAnnotator(false, new ChunkerConfigurator().getDefaultConfig());
POSAnnotator posAnnotator = new POSAnnotator(false);
String s1 = "My estate goes to my husband, son, daughter-in-law, and nephew.";
String s2 = "Notre Dame students had a showdown in 1924 with which anti-catholic group?";
String s3 = "He is a well-respected man.";
String s4 = "We knew there would be a stop-off in Singapore for refuelling.";
String s5 = "The house was unoccupied at the time of the break-in.";
String s6 = "There was a build-up of traffic on the ring road.";
String s7 = "The design is state-of-the-art.";
String s8 = "The slacker video-gamed his way through life.";
String s9 = "Queen Victoria throne-sat for six decades.";
String s10 = "I changed my diet and became a no-meater.";
String s11 = "No-meater is too confusing without the hyphen.";
TextAnnotationBuilder builder = new TokenizerTextAnnotationBuilder(new StatefulTokenizer(false));
TextAnnotation ta = builder.createTextAnnotation("", "", s1);
try {
posAnnotator.addView(ta);
chunkAnnotator.addView(ta);
} catch (AnnotatorException e) {
e.printStackTrace();
}
System.out.println(ta.getAvailableViews());
System.out.println(ta.getView(ViewNames.TOKENS));
System.out.println(ta.getView(ViewNames.SHALLOW_PARSE));
//assertTrue(annotator.getTagValues().equals(set));
}
The main reason to keep it true is to allow NER tagging/coreference linking of hyphenated terms, like "U.S.-led". What is the compelling reason to set the default to false? It could be just that this is the way we used to do it; or maybe, that this is the way that WSJ annotates -- @christos-c is that true? Even if those latter reasons are correct, there is one reason to prefer split as default: it seems intuitively that it is easier to annotate the finer split with POS and NER etc. and then combine them if you want to identify properties of compound terms, rather than to do these finer-grained annotations in a post-processing step. If we have some pertinent data, I think this could be an interesting research project.
Originally, the PennTreebank didn't split on hyphens, but around the time of the Ontonotes corpus release, they systematically started splitting them up: http://cemantix.org/papers/pradhan-hlt-2009-ontonotes-tutorial.pdf (see slide 27) I also think that it's better to split from a syntactic/semantic point of view and the AMR guidelines seem to agree.
@cgraber @qiangning The single-precision branch has the split-by-default tokenization behavior. Qiang, you can ignore the failing test for now as long as you can train and evaluate the singleprecision chunker models (or you can simply change the failing test). Assuming that there are no problems -- and it looks from pos and NER that there aren't -- then we can merge this branch.
Consider this sentence: "Notre Dame students had a showdown in 1924 with which anti-catholic group?"
Using the demo I get: which is correct.
While using the latest version of the pipeline I get:
which separates "anti" and "catholic". Any ideas if this is result of slight weight change for different weights, there is a systematic issue resulting in this?