BramVanroy / astred

An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For instance useful for comparing a translation with the original text, to find differences and similarities between two different translations, or to see how a machine translation differs from a reference translation.
Apache License 2.0
19 stars 0 forks source link

Possible to batch input to automatic word alignment? #3

Open eelegiap opened 2 years ago

eelegiap commented 2 years ago

I've been using the fully-automated level of the tool. I have about 200 sentences pairs (Spanish/English) I want to align, but it's taking forever because I reload the language models every time to run the alignment for one sentence pair.

Is there a way to use the tool in a batched way, or to not load the language models over and over again during alignment? Thank you!

BramVanroy commented 2 years ago

Hello @eelegiap. Thanks for your interest!

Unfortunately, true batching is not available as a built-in. You can however, do batch processing of your sentences on your own and then create Sentences from the resulting docs with Sentence.from_parser. However, since you are only using 200 sentences I instead recommend to instead load the parsers separately so that they do not need to be reloaded every time. This works by passing a parser object instead of a language code to Sentence.from_text. The following should work.

from astred.aligned import AlignedSentences, Sentence
from astred.aligner import Aligner
from astred.utils import load_parser

nlp_en = load_parser("en", "stanza", is_tokenized=False, verbose=True)
nlp_es = load_parser("es", "stanza", is_tokenized=False, verbose=True)
aligner = Aligner()

your_data = [("This is a Spanish sentence.", "Esta es una oración en español."),
             ("Sorry, I do not speak Spanish", "Lo siento, no hablo español.")]

for sent_en_str, sent_es_str in your_data:
    sent_en = Sentence.from_text(sent_en_str, nlp_en)
    sent_es = Sentence.from_text(sent_es_str, nlp_es)
    aligned = AlignedSentences(sent_en, sent_es, aligner=aligner)
    # Do stuff
    for word in sent_en.no_null_words:
        print(word.text, [w.text for w in word.aligned if not w.is_null])

Please let me know if you encounter any other issues!

eelegiap commented 2 years ago

Thank you, @BramVanroy, I'll try it out! One more thing -- On about 25% of my sentences, I've been getting an assertion error during the Stanza load: assert(int(word.head) == int(head.id)) I am pretty sure that the problem is coming from Line 155 in utils.py file during the Stanza initialization. I think if you add 'mwt' multi-word tokens to the processor pipeline, it should solve the problem! (from https://github.com/stanfordnlp/stanza/issues/272)

BramVanroy commented 2 years ago

That's a good catch! It's indeed because the processors are hardcoded. I vaguely remember that MWT would cause issues but I'd have to test. I'll try to have a look this weekend.