explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
726 stars 60 forks source link

Support for Spacy 3 #61

Closed Nithin-Holla closed 3 years ago

Nithin-Holla commented 3 years ago

The current latest release works with Spacy 2.3. Is there a release planned soon that supports Spacy 3?

adrianeboyd commented 3 years ago

We are working on it, but don't have a planned release date at this point.

If you'd like you can try out PR #58 and leave a comment if you run into any bugs. I expect the final version to be very similar, but there may be some minor changes to the API before it's merged.

buhrmann commented 3 years ago

Just to confirm that I've tested the PR and it seems to work fine so far. Only thing is it's very, very slow, at least on CPU. I haven't measured in direct comparison with previous versions, but I have the feeling it's almost another order of magnitude slower now than spacy v2 with the previous generation of stanford models (approx. 10s/text with stanza vs 100texts/s with spacy on my computer for shortish news articles). Not sure why, probably just the new Stanza models being optimized for GPU...

In any case, any easy gain would be to use the new batch processing in Stanza: https://github.com/stanfordnlp/stanza/pull/611/files. I may have a go at refactoring spacy-stanza after PR #58 so we can pipe batches of documents through Stanza first, and only then convert them to spaCy docs, instead of doing it one by one.

adrianeboyd commented 3 years ago

Yes, we've been waiting for better batching options, but we'll have to wait until it lands in a stanza release.

I don't think anything related to changes in spacy v3 would affect the speed much, so it's probably related to stanza itself. The Doc conversion here is hardly changed from v0.2.x.

adrianeboyd commented 3 years ago

v1.0.0 now released!

buhrmann commented 3 years ago

Great! We've been ahead in our fork and had already merged an earlier version of your PR hahaha But now we can sync more safely with master.

Unless you're going to work on this yourself in the next couple of weeks, I may have a go at a little refactor so that everything is ready for using Stanza's batch processing mode. I think it'd be simply a question of moving most of the current code in the __call__ method into a separate stanzadoc_to_spacydoc) function (or similar). That way, __call__ would create a single stanza doc and convert it using this new function, and pipe would create batches of stanza docs, and yield equally converted spacy docs from them. It could be setup such that activating Stanza's batch mode would be just changing a single line, from

processed_batch = [snlp_doc[txt] for txt in txt_batch]

to

processed_batch = stanza.pipe(txt_batch)

In any case, let me know if you want me to have a go or not...

adrianeboyd commented 3 years ago

Sure, I think the changes are pretty minimal. The main thing is the pipe implementation like you described (should look similar to TrainablePipe.pipe using minibatch).

Language.pipe doesn't use a pipe method for the tokenizer (it only uses nlp.make_doc, which is allowed to differ from tokenizer.__call__ in theory although I don't know that it ever does in practice), but it's something that could be considered so you can use nlp.pipe rather than nlp.tokenizer.pipe, since nlp.pipe is much easier to use if you have additional pipe-containing components in your pipeline. It's possible we'll want to stick with the consistent use of nlp.make_doc in nlp.pipe, though, we'll just have to discuss it.