Closed Manslow closed 6 years ago
The doc.text
is built dynamically from unhashing the token.lex.orth
attributes and inserting spacing based on the token attributes. So, there's a single source of truth --- there's no reconciliation problem.
I would use either .norm
or .lemma
for what you want. Then you can ask questions about token.norm_
or token.lemma_
. You can also customize token.lower_
if you want.
I think it's better for predictability if the .text
remains unchanged. This is a matter of taste though. There's no technical reason why we couldn't support setting a different text value to the merged span.
I was using .lemma as a hack but I'd prefer that variable to maintain its original semantics as in some contexts it would be a lemma and in others (the one I mentioned about with 'transient') a data holder with unspecified user created semantics.
Separately to the issue, I'd also like to say that you are a force of nature! I've never seen such fast and helpful responses. Very impressed.
I managed to achieve this with the new extensions capability of spacy 2 by creating a 'transient' token extension:
spacy.tokens.token.Token.set_extension('transient', default='')
Instantiating the transient attribute:
for tok in doc:
tok._.transient = tok.text
Then altering this transient variable in my pipeline. I think this is a suitable solution for my purposes but I still wonder if it is common enough a requirement to merit built-in support.
I think with custom pipelines, it's valuable to give developers some invariants they can rely on. Knowing that nothing that runs before you in the pipeline will change the text is a pretty important guarantee. Without this you have to be much more paranoid about whether your method will be correct.
Extensions can still change the text --- they just have to write a Cython extension to do so. I think that's a reasonable balance: if someone jumps the safety barriers and really insist on editing the underlying text, they can. But otherwise from Python, the source text stays immutable. I'll therefore close the issue.
Thanks!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
When merging two tokens, the text attribute of the new token is
tokena.text + ' ' + tokenb.text
. I would much prefer the ability to specify a new text label such astokena.text + '-' + tokenb.text
or perhaps simply'merged'
. I want it to be clear that the new token is a single concept (a single token) and not an aggregation of two concepts. This is useful when converting the Doc object to a string where I would much rather do text replacement as part of merging tokens rather than as a post-processing step.Another example of when this would be useful is when getting lemmas as part of a pipeline. Currently we have to extract the lemma strings from the document leaving us with only a list of strings without any of the context the document provided about the lemma. Why not allow the 'text' attribute to become the lemma so that we don't have to lose the document context. Worse than this, in a general pipeline, after lemmatisation one might want to get the document context again, requiring a second parsing of the document (with less features than were available before due to the loss of information from lemmatisation). For example, when removing stopwords after lemmatisation, you would either have to implement a way to identify stopwords from the language model and compare your lemma strings to it, or re-parse the text into a Document to use the is_stop attribute.
I suppose one reason for not allowing this is that it is an alteration of the text underlying the document. But surely it isn't difficult to update the underlying text and various references to it from a simple merge? If it is, then maybe there is need of a new Token attribute called something like 'transient' which can be set by the user and that can be unrelated to the original text to essentially allow for memory of the lexical form attained during the last processing step in a pipeline. By that I mean that if you are passing around Doc objects through a pipeline, you can use it to store the output of the previous text processing. You would then never have to re-parse a string into a Doc object. If you did want to parse a new document from just the transient strings you could call doc.reparse() and the transient string attributes would then become the backing text for the Document and fill the text attribute.
We could of course do this ourselves with custom attributes but this seems like such a standard thing to want that perhaps it could be an explicit part of either the Token class or as part of the example usage documentation?