gabrielStanovsky / unified-factuality

Code, data and models for the paper "Integrating Deep Linguistic Features in Factuality Prediction over Unified Datasets" (Stanovsky, Eckle-Kohler, Puzikov, Dagan and Gurevych ACL 2017)
MIT License
7 stars 0 forks source link

Factbank's align method fails on ". . ." #21

Closed rudinger closed 7 years ago

rudinger commented 7 years ago

Fails with this exception:

Traceback (most recent call last):
  File "./readers.py", line 659, in <module>
    os.path.join(inp_, "tokens_tml.txt"))
  File "./readers.py", line 136, in __init__
    self.conll_txt = self.convert(tokens_tml)
  File "./readers.py", line 215, in convert
    dep_feats = self.get_dep_feats(toks, cur_sent)
  File "./readers.py", line 236, in get_dep_feats
    self.align(toks, sent)
  File "./readers.py", line 297, in align
    cur_word)))
Exception: Unknown case: (" He was told . . . the handwriting was on the wall , " the source said Monday ., '.', 'the')

I think what is happening is that, in the previous step (of the align method), cur_tok is '.' and cur_word is '. . .' so a couple things are going wrong: (1) cur_tok + str(toks[toks_ind + 1]) returns '..' instead of '. .', so it is not recognized as a substring of '. . .' when it should be (because the space is missing). (2) It seems like this is a case where toks[toks_ind : toks_ind + 2].merge() should be the selected action, but in fact it should be something like toks[toks_ind : toks_ind + 3].merge(), because the word actually corresponds to three separate tokens.

This is the temporary/hacky solution I put in the method:

    def align(self, toks, sent):
        """
        Match between the spacy tokens in toks to the words in sent
        Might merge tokens in spacy in-place.
        """
        toks_ind = 0
        sent_ind = 0
        ret = []
        while sent_ind < len(sent):
 #           logging.debug("sent_ind = {}, toks_ind = {}".format(sent_ind, toks_ind))
            cur_tok = str(toks[toks_ind])
            cur_word = sent[sent_ind][1]
 #           logging.debug("{} vs. {}".format(cur_tok, cur_word))
 #           logging.debug("flag = {}".format(cur_word.endswith(cur_tok)))
            print "toks: ", toks #RR
            print "cur_tok: ", cur_tok #RR
            print "cur_word: ", cur_word #RR
            ### hacky bug fix next 3 lines ###
            if cur_tok == "." and cur_word == ". . .":
                toks[toks_ind : toks_ind + 3].merge()
                continue
            if (cur_tok == cur_word) or \
               (cur_word.endswith(cur_tok) and \
                (toks_ind >= (len(toks) -1) or ((cur_tok + str(toks[toks_ind + 1])) not in cur_word))):
                # rest of method...
gabrielStanovsky commented 7 years ago

@rudinger, similar to #20, I think this may be caused from using different factbank version than the one we were using. We had a lot of grievances when aligning spaCy to external tokenization, so it makes sense that something like this may occur if we have different versions.

Is it possible for you to attach the sentences which fail with this error?

rudinger commented 7 years ago
'SJMN91-06338157.tml'|||11|||'"He was told . . . the handwriting was on the wall," the source said Monday.'
gabrielStanovsky commented 7 years ago

It seems that this error occurs when using a slightly different version of FactBank, which replaces some Uu labels with NA, which seems to be semantically identical. Reverting to Uu labels solves the problem.