How to efficiently parse multiple pre-tokenized sentences?

feralvam commented 4 years ago

Hi,

I have a group of sentences that I want to parse. Each sentence should be considered as a single Passage. So far, I've been doing it with the following code:

def get_parser():
    if not UCCA_PARSER_PATH.parent.exists():
        download_ucca_model()
    update_ucca_path()
    with mock_sys_argv(['']):
        # Need to mock sysargs otherwise the parser will try to use them and throw an exception
        return Parser(str(UCCA_PARSER_PATH))

def ucca_parse_sents(sents: List[str]):
    passages = list(ucca.convert.from_text(sents, one_per_line=True))
    parser = get_parser()
    parsed_passages = [passage for (passage, *_) in parser.parse(passages, display=False)]
    return parsed_passages

This works fine. However, I did not realise that each sentence in sents is tokenized internally using spaCy (as far as I can understand). This is not a desirable behaviour in my particular case since all sents have been previously tokenized. Is there a way to prevent the parser (or ucca.convert.from_text?) from tokenizing sents again?

I know there is a tokenized parameter in ucca.convert.from_text. However, it takes a single list of tokens, and so I would have to call ucca.convert.from_text once per sentence. Is there a more efficient way of achieving what I'm trying to accomplish?

Thanks!

danielhers commented 4 years ago

Hi @feralvam, tokenized=True is indeed what you should use. The current implementation of from_text is not parallel anyway, so it is as efficient to call it multiple times. Assuming your tokens are separated by spaces, you could do

passages = []
for sent in sents:
    passages += list(ucca.convert.from_text(sent.split(), tokenized=True))

to get the list of passages.

feralvam commented 4 years ago

Awesome! Thanks for the quick reply.

danielhers / tupa

How to efficiently parse multiple pre-tokenized sentences? #95