EmilStenstrom / conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
MIT License
311 stars 50 forks source link

Idea: Introduce helper TokenTree methods #72

Closed PROrock closed 4 months ago

PROrock commented 1 year ago

Hi, first of all, thanks for this awesome library, it is really helpful for my current project :-)

Second, for my current project I had to implement some helper methods regarding searching and collecting subtree tokens I want to share with you. You can decide if you want to incorporate some of them into the library (probably as instance methods on TokenTree). It might help others too.

Code:

from operator import itemgetter
from typing import Optional, List, Text
from conllu import TokenTree, TokenList, Token

def _get_node_by_id(tree_node: TokenTree, id_to_find: int) -> Optional[TokenTree]:
    to_traverse = [tree_node]
    while len(to_traverse):
        node = to_traverse.pop()
        if node.token["id"] == id_to_find:
            return node
        to_traverse.extend(node.children)
    return None

def _get_node_by_id_recursive(tree_node: TokenTree, id_to_find: int) -> Optional[TokenTree]:
    if tree_node.token["id"] == id_to_find:
        return tree_node

    for child_token in tree_node.children:
        if found_node := _get_node_by_id_recursive(child_token, id_to_find):
            return found_node

    return None

def _collect_all_subtree_tokens(tree_node: TokenTree) -> List[Token]:
    subtree_nodes = []
    to_traverse = [tree_node]
    while len(to_traverse):
        node = to_traverse.pop()
        subtree_nodes.append(node.token)
        to_traverse.extend(node.children)
    return subtree_nodes

# more general, could be used in the method get_word_subtree more or less instead of _collect_all_subtree_tokens
def to_list(root_node: TokenTree) -> TokenList:
    def flatten_tree(root_token: TokenTree, token_list: List[Token]) -> List[Token]:
        token_list.append(root_token.token)
        for child_token in root_token.children:
            flatten_tree(child_token, token_list)
        return token_list

    flatten_list = flatten_tree(root_node, [])

    flatten_list_by_id = sorted(flatten_list, key=itemgetter("id"))
    return TokenList(flatten_list_by_id, root_node.metadata)

def get_word_subtree(tree_node: TokenTree, token_id: int) -> Optional[Text]:
    word_node = _get_node_by_id(tree_node, token_id)
    if word_node is None:
        return None

    subtree_tokens = _collect_all_subtree_tokens(word_node)
    sorted_tokens_by_id = sorted(subtree_tokens, key=itemgetter("id"))
    return " ".join(token["form"] for token in sorted_tokens_by_id)
EmilStenstrom commented 1 year ago

Hi, sorry for the very late reply here :)

Thanks for sharing your code snippets for traversing the trees. This is the first time I hear demands about adding these kinds of helpers, so I'm a bit hesitant to add them. I think I'll keep this issue open, and if there are more people suggesting this is a good idea, they should make themselves heard in this thread! :)

peterr-s commented 1 year ago

+1 on having a use-case for/having reimplemented a to_list() in my current project. I can see the utility for get_word_subtree() in the abstract as well.