explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.09k stars 4.4k forks source link

Doc.to_array and custom extension #2532

Closed Jean-Zombie closed 5 years ago

Jean-Zombie commented 6 years ago

Feature description

(I posted this as a how-to-question on Stackoverflow, but since nobody replied there I assume it is not possible yet.)

I'd really like the Doc.to_array method to include the custom extension I set to the Token class like:

from spacy.attrs import CUSTOM_EXTENSION, POS, TAG
doc.to_array([CUSTOM_EXTENSION, POS, TAG])

If that would be a trivial but maybe tedious task to implement, I'd be happy to help.

DuyguA commented 6 years ago

I didn't understand exactly. Can you give an example output, input pair?

Jean-Zombie commented 6 years ago

Ah, sorry for being kryptic. My workflow contains a custom extension (for lemmatized words).

Token.set_extension("custom_lemma", default="", force=True)
# the extension will be set during parsing via a custom lemmatizer function called 'lemmatizer'
nlp.add_pipe(lemmatizer)

Now, I'd like to go from here:

from spacy.attrs import ORTH, POS

doc = nlp("Ich bin müde.")
doc.to_array([ORTH, POS])
array([[ 3126701799202552694,    94],
       [ 8404852791380219477,    86],
       [10386993667692466914,    83],
       [12646065887601541794,    96]], dtype=uint64)

to:

from spacy.attrs import ORTH, POS, custom_lemma # magically retrieving my custom extension here

doc = nlp("Ich bin müde.")
doc.to_array([ORTH, POS, custom_lemma])
array([[ 3126701799202552694,    94,    345345], # last columns numbers are made up ;-)
       [ 8404852791380219477,    86,    23523], 
       [10386993667692466914,    83,    234],
       [12646065887601541794,    96,    235354]], dtype=uint64)
honnibal commented 6 years ago

@Jean-Zombie There's not really a way we could provide that feature, because there's no type-constraint on what you could set into the custom attributes.

The data backing the user-defined attributes will be saved in to the doc.user_data dictionary, so you could serialize that with Pickle (or if you limit yourself to plain types, json or msgpack or something).

You could also consider having your extension write to the built-in .lemma attribute.

Jean-Zombie commented 6 years ago

there's no type-constraint on what you could set into the custom attributes.

Right, I undestand why that could cause trouble.

You could also consider having your extension write to the built-in .lemma attribute.

Nice! That option didn't even cross my mind. Will check it out. Much appreciated.

yarongon commented 5 years ago

On the same issue, about Doc.user_data structure: as far as I've seen, its a dictionary, where the value is the value of the extension. However, I don't understand the key: its a 4-tuple, where the 3rd component is the character offset of the token owning the extension. Why is it so, and not the token index? in addition, can I trust this structure will not change?

Also, how can the .lemma attribute be used for that?

(A little background: I'm interested in deleting tokens from a document. I've seen examples using Doc.to_array for that, but since this method does not handle the extensions, I need a way to copy them as well. This is how I found this issue)

honnibal commented 5 years ago

However, I don't understand the key: its a 4-tuple, where the 3rd component is the character offset of the token owning the extension. Why is it so, and not the token index?

The tokenization can change, but the character offsets can't. If we used the token offset, things would break when you used doc.retokenize() to split or merge tokens.

in addition, can I trust this structure will not change?

It's an implementation detail, but I'd say it's quite stable yes. We wouldn't change it lightly.

ines commented 5 years ago

We just released spaCy v2.2 with the new DocBin class for efficient binary serialization of Doc objects!

Details: https://spacy.io/usage/saving-loading#docs API: https://spacy.io/api/docbin

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.