BLLIP / bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.
http://bllip.cs.brown.edu/
227 stars 53 forks source link

Head node not a direct child #56

Closed jofatmofn closed 7 years ago

jofatmofn commented 7 years ago

I am having the following code `

constituency_string = str(rrp.parse_tagged(tokens, possible_tags=dict(enumerate(postags)))[0].ptb_parse)

tree = Tree(constituency_string)

`

For the sentence "An interesting date is four days from today.", the expected head (a direct child) and the actual head (pre-terminal) from tree object are depicted below:

`

(S1                                     # Expected head: S; Got VBZ

    (S                                  # Expected head: VP; Got VBZ
        (NP                             # Head: NN
            (DT An) 
            (JJ interesting) 
            (NN date)) 
        (VP                             # Head: VBZ
            (VBZ is) 
            (NP                         # Expected head: NP; Got NNS
                (NP                     # Head: NNS
                    (CD four)
                    (NNS days)) 
                (PP                     # Head: IN
                    (IN from) 
                    (NP                 # Head: NN
                        (NN today)))))
        (. .)))

` I am creating NAF output for the subsequent coreference resolution module. I have written additional code to match the expected results. Is this a bug in bllipparser?

dmcc commented 7 years ago

Thanks for the report. BLLIP Parser calls them heads, but I think this is a bit of a misnomer and they're really closer to dependencies (in a governor-dependent sense). I'm afraid if you're looking for direct children, you'll need to extend the "head finder" or use the extracted dependencies and walk up the tree to find direct children.

On Thu, Jun 22, 2017 at 8:43 PM, jofatmofn notifications@github.com wrote:

I am having the following code `

constituency_string = str(rrp.parse_tagged(tokens, possible_tags=dict(enumerate(postags)))[0].ptb_parse)

tree = Tree(constituency_string)

`

For the sentence "An interesting date is four days from today.", the expected head (a direct child) and the actual head (pre-terminal) from tree object are depicted below:

`

(S1 # Expected head: S; Got VBZ

(S # Expected head: VP; Got VBZ (NP # Head: NN (DT An) (JJ interesting) (NN date)) (VP # Head: VBZ (VBZ is) (NP # Expected head: NP; Got NNS (NP # Head: NNS (CD four) (NNS days)) (PP # Head: IN (IN from) (NP # Head: NN (NN today))))) (. .)))

` I am creating NAF output for the subsequent coreference resolution module. I have written additional code to match the expected results. Is this a bug in bllipparser?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BLLIP/bllip-parser/issues/56, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm5ZfsBbvoVPIHXZHOJXG8COM2ScHy5ks5sGzRdgaJpZM4ODGdb .

jofatmofn commented 7 years ago

I am using pynaf to generate NAF output and I need to call naf_document.add_constituency_tree. Have decided to use extracted dependencies and walk up the tree. Sharing the code, with the hope that it is useful to someone.

def constituent_tree_to_naf(parent_node, parent_tid, is_parent_root):
    # Depth first tree navigation
    # This method will NOT be called with parent_node a preterminal. Hence it is assured that all the child_nodes are nonterminals.
    global tid, terminals, ntid, non_terminals, edgeid, edges, direct_head_less, edge_idx
    head_in_child = False
    for child_node in parent_node.__iter__():
        if parent_node.head().__str__() == child_node.__str__():
            head_in_child = True
            break
    if not head_in_child:
        direct_head_less.append((parent_node, None, True))  # Headless node, edge_id to put header attribute, if head is yet to be found

    for child_node in parent_node.__iter__():

        # non_terminals (constituent_id, constituent_Label)
        ntid += 1
        non_terminals.append(("nter" + str(ntid), child_node.label))

        for i, dhl_t in enumerate(direct_head_less):
            if dhl_t[2] and dhl_t[0].head().__str__() == child_node.__str__():
                edges[dhl_t[1]] = edges[dhl_t[1]] + ("yes",) 
                direct_head_less[i] = (dhl_t[0], dhl_t[1], False)

        # edges. (edge_id, from_id,to_id, head)
        edgeid += 1
        edge_idx += 1
        if is_parent_root or parent_node.head().__str__() == child_node.__str__():
            edges.append(("tre" + str(edgeid), "nter" + str(ntid), "nter" + str(parent_tid), "yes"))
        else:
            edges.append(("tre" + str(edgeid), "nter" + str(ntid), "nter" + str(parent_tid)))

        if child_node.is_preterminal():
            # terminals. (constituent_id, [term_id])
            tid = tid + 1
            terminals.append(("ter" + str(tid), ["t" + str(tid)])) # TODO: Check if there can be a situation where term_id <> terminal id

            # edges. (edge_id, from_id,to_id, head)
            edgeid += 1
            edge_idx += 1
            edges.append(("tre" + str(edgeid), "ter" + str(tid), "nter" + str(ntid)))
        else:   # non terminal, but not pre terminal
            for i, dhl_t in enumerate(direct_head_less):
                if dhl_t[2] and parent_node.__str__() == dhl_t[0].__str__():
                    direct_head_less[i] = (dhl_t[0], edge_idx, dhl_t[2])

            constituent_tree_to_naf(child_node, ntid, False)

The calling method has this code (where tokens is the list of tokens and postags is the corresponding POS tags):

    global tid, terminals, ntid, non_terminals, edgeid, edges, direct_head_less, edge_idx
    tid = -1
    ntid = -1
    edgeid = -1

For each sentence

        terminals = []
        non_terminals = []
        edges = []
        direct_head_less = []
        edge_idx = -1
        ntid += 1
        non_terminals.append(("nter" + str(ntid), "ROOT"))
        constituency_lisp_string = str(rrp.parse_tagged(tokens, possible_tags=dict(enumerate(postags)))[0].ptb_parse)
        tree = Tree(constituency_lisp_string)
        head = tree.head()
        constituent_tree_to_naf(tree, ntid, True)
        naf_document.add_constituency_tree(non_terminals, terminals, edges)