Open BramVanroy opened 3 years ago
Sorry for not getting back to you about this sooner. I think my main concern would be that it sounds like it's going to be relatively hard to use this annotation from a spacy Doc
. I haven't looked into how they store the constituency trees in detail, but using plain stanza with its original data structures sounds like it might be better from a usability perspective? What do you think are the advantages of having this in a spacy Doc
?
A tree is an iterable of subtrees with ultimately Words as terminals and linguistic categories as intermediate nodes. From that perspective, I was thinking of having a similar Tree structure in the spacy_stanza API that used spaCy Tokens instead. You'd still be able to traverse the constituency tree as per stanza API but the terminals that you get out of it are spaCy tokens. I might be biased, but this would be useful in my own research where I want to use constituency trees on the one hand as well as spaCy's extensibility for my own components.
As always, if this does not seem useful for the wider user-base to you, then we can close this topic.
Since release v1.3.0, stanza has a constituency parser for English. Support for more languages will follow. It would be great if we could access the constituency parse from within the spaCy wrapper too.
At first I thought I'd create a separate package for this that uses spacy_stanza under the hood and then registers a custom component that adds the constituency parse. However, that implies either copying most of
spacy_stanza
, subclassingStanzaTokenizer
, and writing to Underscore objects in__call__
. Or only creating a custom component and after receiving a Doc, parsing its text again with stanza to get the constituency parse. Neither of these are ideal, so I would hope that you are open to incorporating such functionality in spacy_stanza directly.The stanza constituency parser adds a
constituency
object (aTree
) to every sentence. Things that may be considered._.constituency
for every sentence span and every span that is a constituent (a sub-classed stanza Tree)._.constituency
for every Token, which would be its subtree in the full tree with itself as the node (a sub-classed stanza Tree)If you agree I can work on this from time to time.