Token's idx in a lot of spaces text

AlexeySlvv commented 4 years ago

I'm parsing texts with "en_ewt" model (default stanfordnlp English model). My program like this:

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage
pipe = stanfordnlp.Pipeline(lang='en')
sp = StanfordNLPLanguage(pipe)

text = 'In a hole in the ground there lived a hobbit.'
doc = sp(text)
for token in doc:
    print(token.text, token.idx)

So the text is "In a hole in the ground there lived a hobbit.", I want to get token's text and idx and the results are: In 0 a 3 hole 5 in 10 etc.

That's OK. Now, I put several spaces in the text, e.g. "In(5 spaces)a(5 spaces)hole in the ground there lived a hobbit." (5 spaces between "In", "a" & "hole"). I expect that token's idx should be 0, 7, 13, 18, etc. but they are still 0, 3, 5, 10.

So as I understand, the model counts several spaces as one space no matter how there really are. Is there any way to tell Stanford models to count spaces as is?

spaCy native model (like en_core_web_sm) works well in this case, but I'd like to work with Stanford models.

Thank you.

AlexeySlvv commented 4 years ago

GitHub ate my spaces in the second text example, so I edited it a little.

ines commented 4 years ago

So as I understand, the model counts several spaces as one space no matter how there really are. Is there any way to tell Stanford models to count spaces as is?

Yes, as far as I know, the StanfordNLP models don't preserve the original inpunt text. They use a neural network approach for tokenization and the output may differ from the input (whitespace, contractions etc). That's just how the model works.

spaCy uses a non-destructive tokenization approach, so the output will always reflect the original text. You could try and get the output of both pipelines and then write a function to align the Stanford tokens to the spaCy tokens. But how well this works depends on the output and how easy it is to align the predicted tags to the non-destructive tokens.

AlexeySlvv commented 4 years ago

Thank you again. I wrote a little post-processor to calculate token's positions and create a list like span_tokenize from NLTK. Hope it will help someone else:

text = "In a hole in the ground there lived a hobbit."
doc = sp(text)
cursor, span_list = 0, []
for tt in (t.text for t in doc):
    start = text.find(tt, cursor)
    cursor = start+len(tt)
    span_list.append((start, cursor))
    print(tt, start, cursor)

explosion / spacy-stanza

Token's idx in a lot of spaces text #21