Closed feralvam closed 4 years ago
Hi @feralvam,
tokenized=True
is indeed what you should use. The current implementation of from_text
is not parallel anyway, so it is as efficient to call it multiple times. Assuming your tokens are separated by spaces, you could do
passages = []
for sent in sents:
passages += list(ucca.convert.from_text(sent.split(), tokenized=True))
to get the list of passages.
Awesome! Thanks for the quick reply.
Hi,
I have a group of sentences that I want to parse. Each sentence should be considered as a single Passage. So far, I've been doing it with the following code:
This works fine. However, I did not realise that each sentence in
sents
is tokenized internally using spaCy (as far as I can understand). This is not a desirable behaviour in my particular case since allsents
have been previously tokenized. Is there a way to prevent the parser (orucca.convert.from_text
?) from tokenizingsents
again?I know there is a
tokenized
parameter inucca.convert.from_text
. However, it takes a single list of tokens, and so I would have to callucca.convert.from_text
once per sentence. Is there a more efficient way of achieving what I'm trying to accomplish?Thanks!