Open eelegiap opened 2 years ago
Hello @eelegiap. Thanks for your interest!
Unfortunately, true batching is not available as a built-in. You can however, do batch processing of your sentences on your own and then create Sentence
s from the resulting docs with Sentence.from_parser
. However, since you are only using 200 sentences I instead recommend to instead load the parsers separately so that they do not need to be reloaded every time. This works by passing a parser object instead of a language code to Sentence.from_text
. The following should work.
from astred.aligned import AlignedSentences, Sentence
from astred.aligner import Aligner
from astred.utils import load_parser
nlp_en = load_parser("en", "stanza", is_tokenized=False, verbose=True)
nlp_es = load_parser("es", "stanza", is_tokenized=False, verbose=True)
aligner = Aligner()
your_data = [("This is a Spanish sentence.", "Esta es una oración en español."),
("Sorry, I do not speak Spanish", "Lo siento, no hablo español.")]
for sent_en_str, sent_es_str in your_data:
sent_en = Sentence.from_text(sent_en_str, nlp_en)
sent_es = Sentence.from_text(sent_es_str, nlp_es)
aligned = AlignedSentences(sent_en, sent_es, aligner=aligner)
# Do stuff
for word in sent_en.no_null_words:
print(word.text, [w.text for w in word.aligned if not w.is_null])
Please let me know if you encounter any other issues!
Thank you, @BramVanroy, I'll try it out! One more thing -- On about 25% of my sentences, I've been getting an assertion error during the Stanza load:
assert(int(word.head) == int(head.id))
I am pretty sure that the problem is coming from Line 155 in utils.py
file during the Stanza initialization. I think if you add 'mwt' multi-word tokens to the processor pipeline, it should solve the problem! (from https://github.com/stanfordnlp/stanza/issues/272)
That's a good catch! It's indeed because the processors are hardcoded. I vaguely remember that MWT would cause issues but I'd have to test. I'll try to have a look this weekend.
I've been using the fully-automated level of the tool. I have about 200 sentences pairs (Spanish/English) I want to align, but it's taking forever because I reload the language models every time to run the alignment for one sentence pair.
Is there a way to use the tool in a batched way, or to not load the language models over and over again during alignment? Thank you!