explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Add stanza constituency output #78

Open BramVanroy opened 2 years ago

BramVanroy commented 2 years ago

Since release v1.3.0, stanza has a constituency parser for English. Support for more languages will follow. It would be great if we could access the constituency parse from within the spaCy wrapper too.

At first I thought I'd create a separate package for this that uses spacy_stanza under the hood and then registers a custom component that adds the constituency parse. However, that implies either copying most of spacy_stanza, subclassing StanzaTokenizer, and writing to Underscore objects in __call__. Or only creating a custom component and after receiving a Doc, parsing its text again with stanza to get the constituency parse. Neither of these are ideal, so I would hope that you are open to incorporating such functionality in spacy_stanza directly.

The stanza constituency parser adds a constituency object (a Tree) to every sentence. Things that may be considered

If you agree I can work on this from time to time.

adrianeboyd commented 2 years ago

Sorry for not getting back to you about this sooner. I think my main concern would be that it sounds like it's going to be relatively hard to use this annotation from a spacy Doc. I haven't looked into how they store the constituency trees in detail, but using plain stanza with its original data structures sounds like it might be better from a usability perspective? What do you think are the advantages of having this in a spacy Doc?

BramVanroy commented 2 years ago

A tree is an iterable of subtrees with ultimately Words as terminals and linguistic categories as intermediate nodes. From that perspective, I was thinking of having a similar Tree structure in the spacy_stanza API that used spaCy Tokens instead. You'd still be able to traverse the constituency tree as per stanza API but the terminals that you get out of it are spaCy tokens. I might be biased, but this would be useful in my own research where I want to use constituency trees on the one hand as well as spaCy's extensibility for my own components.

As always, if this does not seem useful for the wider user-base to you, then we can close this topic.