Closed Jean-Zombie closed 5 years ago
I didn't understand exactly. Can you give an example output, input pair?
Ah, sorry for being kryptic. My workflow contains a custom extension (for lemmatized words).
Token.set_extension("custom_lemma", default="", force=True)
# the extension will be set during parsing via a custom lemmatizer function called 'lemmatizer'
nlp.add_pipe(lemmatizer)
Now, I'd like to go from here:
from spacy.attrs import ORTH, POS
doc = nlp("Ich bin müde.")
doc.to_array([ORTH, POS])
array([[ 3126701799202552694, 94],
[ 8404852791380219477, 86],
[10386993667692466914, 83],
[12646065887601541794, 96]], dtype=uint64)
to:
from spacy.attrs import ORTH, POS, custom_lemma # magically retrieving my custom extension here
doc = nlp("Ich bin müde.")
doc.to_array([ORTH, POS, custom_lemma])
array([[ 3126701799202552694, 94, 345345], # last columns numbers are made up ;-)
[ 8404852791380219477, 86, 23523],
[10386993667692466914, 83, 234],
[12646065887601541794, 96, 235354]], dtype=uint64)
@Jean-Zombie There's not really a way we could provide that feature, because there's no type-constraint on what you could set into the custom attributes.
The data backing the user-defined attributes will be saved in to the doc.user_data
dictionary, so you could serialize that with Pickle (or if you limit yourself to plain types, json or msgpack or something).
You could also consider having your extension write to the built-in .lemma
attribute.
there's no type-constraint on what you could set into the custom attributes.
Right, I undestand why that could cause trouble.
You could also consider having your extension write to the built-in .lemma attribute.
Nice! That option didn't even cross my mind. Will check it out. Much appreciated.
On the same issue, about Doc.user_data
structure: as far as I've seen, its a dictionary, where the value is the value of the extension. However, I don't understand the key: its a 4-tuple, where the 3rd component is the character offset of the token owning the extension. Why is it so, and not the token index? in addition, can I trust this structure will not change?
Also, how can the .lemma
attribute be used for that?
(A little background: I'm interested in deleting tokens from a document. I've seen examples using Doc.to_array
for that, but since this method does not handle the extensions, I need a way to copy them as well. This is how I found this issue)
However, I don't understand the key: its a 4-tuple, where the 3rd component is the character offset of the token owning the extension. Why is it so, and not the token index?
The tokenization can change, but the character offsets can't. If we used the token offset, things would break when you used doc.retokenize()
to split or merge tokens.
in addition, can I trust this structure will not change?
It's an implementation detail, but I'd say it's quite stable yes. We wouldn't change it lightly.
We just released spaCy v2.2 with the new DocBin
class for efficient binary serialization of Doc
objects!
Details: https://spacy.io/usage/saving-loading#docs API: https://spacy.io/api/docbin
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Feature description
(I posted this as a how-to-question on Stackoverflow, but since nobody replied there I assume it is not possible yet.)
I'd really like the Doc.to_array method to include the custom extension I set to the Token class like:
If that would be a trivial but maybe tedious task to implement, I'd be happy to help.