Text attribute of the sentence object is empty

sharpsy commented 2 years ago

Describe the bug As the title says, the text property of the sentence object is always empty.

To Reproduce

import classla, stanza

testcase_sentence = "France Prešeren je rojen v Vrbi."

st_nlp = stanza.Pipeline(lang="sl", processors="tokenize")
cl_nlp = classla.Pipeline(lang="sl", processors="tokenize")

print(st_nlp(testcase_sentence).sentences[0].text)  # prints text as expected
print(cl_nlp(testcase_sentence).sentences[0].text)  # would expect the same as in the line above

Expected behavior I would expect that both sentence objects contain the original text of the sentence. This makes it harder to implement some workflows. In my case, I would like to cross-reference the information received from classla (ner, lemma, upos) with the output from BERT-like models. Those models use different tokenizers that expect the raw text - and this issue makes it harder to get to the sentence text when the document contains a number of sentences (ie. there is not just a single one as in the example above). (there is a workaround of parsing the output of sentence._metadata but I would like to avoid it if possible)

Environment (please complete the following information):

OS: Fedora 35
Python version: Python 3.10.4
Stanza version: 1.4.0
Classla version: 1.1.0

nljubesi commented 2 years ago

@lkrsnik Let's deal with this once you have classla back in your focus.

@sharpsy this might take a few weeks. We, of course, also do not mind pull requests. The underlying issue should be easy to fix.

lkrsnik commented 2 years ago

This was fixed with the latest classla release.

clarinsi / classla

Text attribute of the sentence object is empty #32