explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.13k stars 4.4k forks source link

Change token text #1544

Closed Manslow closed 6 years ago

Manslow commented 6 years ago

When merging two tokens, the text attribute of the new token is tokena.text + ' ' + tokenb.text. I would much prefer the ability to specify a new text label such as tokena.text + '-' + tokenb.text or perhaps simply 'merged'. I want it to be clear that the new token is a single concept (a single token) and not an aggregation of two concepts. This is useful when converting the Doc object to a string where I would much rather do text replacement as part of merging tokens rather than as a post-processing step.

Another example of when this would be useful is when getting lemmas as part of a pipeline. Currently we have to extract the lemma strings from the document leaving us with only a list of strings without any of the context the document provided about the lemma. Why not allow the 'text' attribute to become the lemma so that we don't have to lose the document context. Worse than this, in a general pipeline, after lemmatisation one might want to get the document context again, requiring a second parsing of the document (with less features than were available before due to the loss of information from lemmatisation). For example, when removing stopwords after lemmatisation, you would either have to implement a way to identify stopwords from the language model and compare your lemma strings to it, or re-parse the text into a Document to use the is_stop attribute.

I suppose one reason for not allowing this is that it is an alteration of the text underlying the document. But surely it isn't difficult to update the underlying text and various references to it from a simple merge? If it is, then maybe there is need of a new Token attribute called something like 'transient' which can be set by the user and that can be unrelated to the original text to essentially allow for memory of the lexical form attained during the last processing step in a pipeline. By that I mean that if you are passing around Doc objects through a pipeline, you can use it to store the output of the previous text processing. You would then never have to re-parse a string into a Doc object. If you did want to parse a new document from just the transient strings you could call doc.reparse() and the transient string attributes would then become the backing text for the Document and fill the text attribute.

We could of course do this ourselves with custom attributes but this seems like such a standard thing to want that perhaps it could be an explicit part of either the Token class or as part of the example usage documentation?

honnibal commented 6 years ago

The doc.text is built dynamically from unhashing the token.lex.orth attributes and inserting spacing based on the token attributes. So, there's a single source of truth --- there's no reconciliation problem.

I would use either .norm or .lemma for what you want. Then you can ask questions about token.norm_ or token.lemma_. You can also customize token.lower_ if you want.

I think it's better for predictability if the .text remains unchanged. This is a matter of taste though. There's no technical reason why we couldn't support setting a different text value to the merged span.

Manslow commented 6 years ago

I was using .lemma as a hack but I'd prefer that variable to maintain its original semantics as in some contexts it would be a lemma and in others (the one I mentioned about with 'transient') a data holder with unspecified user created semantics.

Separately to the issue, I'd also like to say that you are a force of nature! I've never seen such fast and helpful responses. Very impressed.

Manslow commented 6 years ago

I managed to achieve this with the new extensions capability of spacy 2 by creating a 'transient' token extension:

spacy.tokens.token.Token.set_extension('transient', default='')

Instantiating the transient attribute:

 for tok in doc:
        tok._.transient = tok.text

Then altering this transient variable in my pipeline. I think this is a suitable solution for my purposes but I still wonder if it is common enough a requirement to merit built-in support.

honnibal commented 6 years ago

I think with custom pipelines, it's valuable to give developers some invariants they can rely on. Knowing that nothing that runs before you in the pipeline will change the text is a pretty important guarantee. Without this you have to be much more paranoid about whether your method will be correct.

Extensions can still change the text --- they just have to write a Cython extension to do so. I think that's a reasonable balance: if someone jumps the safety barriers and really insist on editing the underlying text, they can. But otherwise from Python, the source text stays immutable. I'll therefore close the issue.

Thanks!

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.