cltl / multilingual_factuality

3 stars 2 forks source link

KeyError: t_xxx #4

Closed vanatteveldt closed 8 years ago

vanatteveldt commented 8 years ago

I seem to get this error occasionally when running the multilingual factuality.

Traceback (most recent call last):
  File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 413, in <module>
    main()
  File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 406, in main
    run_factuality_module(nafobj)
  File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 392, in run_factuality_module
    events_features = extract_features(feature_extractor, target_events)
  File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 367, in extract_features
    add_predicate_chain_features(feature_extractor, event, myFeatures)
  File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 210, in add_predicate_chain_features
    pred_chain = feature_extractor.get_list_term_ids_to_root(tid)
  File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/my_feature_extractor.py", line 173, in get_list_term_ids_to_root
    root_for_sentence = this_graph.get_root()
  File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/my_feature_extractor.py", line 41, in get_root
    self.calculate_root()
  File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/my_feature_extractor.py", line 35, in calculate_root
    list_with_min_freq = [(term_id, len(self.G[term_id])) for term_id, freq in L if freq == min_freq]
KeyError: 't_840'

An example input file that causes the error can be found here: http://i.amcat.nl/keyerror.naf

rubenIzquierdo commented 8 years ago

About the two problems:

1) The problem of empty paths from a certain term to the root: the main reason is the tokeniser, which is not splitting properly the sentences. The dependency parser (Alpino) runs on these malformed sentences and generates quite a lot of non-sense dependencies, which do not correspond to a valid dependency tree. For instance in the example input, all these tokens go to the same sentence

<wf id="w1" length="6" offset="0" para="1" sent="1">PostNL</wf>
<wf id="w2" length="4" offset="7" para="1" sent="1">gaat</wf>
<wf id="w3" length="4" offset="12" para="1" sent="1">naar</wf>
<wf id="w4" length="4" offset="17" para="1" sent="1">3000</wf>
<wf id="w5" length="12" offset="22" para="1" sent="1">afhaalpunten</wf>
<wf id="w6" length="4" offset="36" para="1" sent="1">door</wf>
<wf id="w7" length="5" offset="41" para="1" sent="1">Wilko</wf>
<wf id="w8" length="8" offset="47" para="1" sent="1">Voordouw</wf>
<wf id="w9" length="9" offset="57" para="2" sent="1">AMSTERDAM</wf>
<wf id="w10" length="1" offset="67" para="2" sent="1">-</wf>
<wf id="w11" length="2" offset="71" para="2" sent="1">Op</wf>
<wf id="w12" length="5" offset="74" para="2" sent="1">korte</wf>
<wf id="w13" length="7" offset="80" para="2" sent="1">termijn</wf>
<wf id="w14" length="4" offset="88" para="2" sent="1">moet</wf>
<wf id="w15" length="3" offset="93" para="2" sent="1">het</wf>
<wf id="w16" length="6" offset="97" para="2" sent="1">aantal</wf>
<wf id="w17" length="19" offset="104" para="2" sent="1">PostNL-afhaalpunten</wf>
<wf id="w18" length="4" offset="124" para="2" sent="1">voor</wf>
<wf id="w19" length="9" offset="129" para="2" sent="1">pakketjes</wf>
<wf id="w20" length="3" offset="139" para="2" sent="1">van</wf>
<wf id="w21" length="4" offset="143" para="2" sent="1">ruim</wf>
<wf id="w22" length="4" offset="148" para="2" sent="1">2000</wf>
<wf id="w23" length="4" offset="153" para="2" sent="1">naar</wf>
<wf id="w24" length="4" offset="158" para="2" sent="1">3000</wf>
<wf id="w25" length="1" offset="162" para="2" sent="1">.</wf>

I included some statements to getting and exception in these case, but the “real” problem is there.

2) About the key error with t_841.

This was related to the way that the module tries to select which is the root node for a certain sentence. Basically it was based on 2 heuristics:

1) Select nodes with the smallest number of dependency relations arriving TO them 2) Select nodes with the biggest number of dependency relations starting FROM them

In some case there could be a tie between several nodes, for instance in the example t_841, we have these dependencies:

<!--whd/body(Hoe,zit)-->
<dep from="t_838" rfunc="whd/body" to="t_839"/>
<!--hd/predc(zit,Hoe)-->
<dep from="t_839" rfunc="hd/predc" to="t_838"/>
<!--hd/su(zit,dat)-->
<dep from="t_839" rfunc="hd/su" to="t_840"/>
<!--- - / - -(Hoe,?)-->
<dep from="t_838" rfunc="-- / --" to="t_841"/>
<!--dp/dp('In,zoektocht)-->

So the candidates with the 2 heuristics were 2:

I included a third heuristic which solves this tie by selecting as root the term which is tagged as a verb, in this case t_839 (ZITTEN).

Ruben Izquierdo Bevia Vrije University of Amsterdam ruben.izquierdobevia@vu.nlmailto:ruben.izquierdobevia@vu.nl http://rubenizquierdobevia.com/

On 18 May 2016, at 23:56, Wouter van Atteveldt notifications@github.com<mailto:notifications@github.com> wrote:

I seem to get this error occasionally when running the multilingual factuality.

Traceback (most recent call last): File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 413, in main() File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 406, in main run_factuality_module(nafobj) File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 392, in run_factuality_module events_features = extract_features(feature_extractor, target_events) File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 367, in extract_features add_predicate_chain_features(feature_extractor, event, myFeatures) File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/rule_based_factuality.py", line 210, in add_predicate_chain_features pred_chain = feature_extractor.get_list_term_ids_to_root(tid) File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/my_feature_extractor.py", line 173, in get_list_term_ids_to_root root_for_sentence = this_graph.get_root() File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/my_feature_extractor.py", line 41, in get_root self.calculate_root() File "/data/wva/newsreader_pipe_nl/modules/multilingual_factuality/feature_extractor/my_feature_extractor.py", line 35, in calculate_root list_with_min_freq = [(term_id, len(self.G[term_id])) for term_id, freq in L if freq == min_freq] KeyError: 't_840'

An example input file that causes the error can be found here: http://i.amcat.nl/keyerror.naf

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHubhttps://github.com/cltl/multilingual_factuality/issues/4

vanatteveldt commented 8 years ago

Wasn't the tokenizer problem solved in https://github.com/cltl/morphosyntactic_parser_nl/pull/7 ? I was fairly confident that I updated the parser before running, but I'll try again.

(btw, you can probably close this if you fixed problem 2, as problem 1 is really issue #3?)