Closed Nithin-Holla closed 3 years ago
We are working on it, but don't have a planned release date at this point.
If you'd like you can try out PR #58 and leave a comment if you run into any bugs. I expect the final version to be very similar, but there may be some minor changes to the API before it's merged.
Just to confirm that I've tested the PR and it seems to work fine so far. Only thing is it's very, very slow, at least on CPU. I haven't measured in direct comparison with previous versions, but I have the feeling it's almost another order of magnitude slower now than spacy v2 with the previous generation of stanford models (approx. 10s/text with stanza vs 100texts/s with spacy on my computer for shortish news articles). Not sure why, probably just the new Stanza models being optimized for GPU...
In any case, any easy gain would be to use the new batch processing in Stanza: https://github.com/stanfordnlp/stanza/pull/611/files. I may have a go at refactoring spacy-stanza after PR #58 so we can pipe batches of documents through Stanza first, and only then convert them to spaCy docs, instead of doing it one by one.
Yes, we've been waiting for better batching options, but we'll have to wait until it lands in a stanza release.
I don't think anything related to changes in spacy v3 would affect the speed much, so it's probably related to stanza itself. The Doc
conversion here is hardly changed from v0.2.x.
v1.0.0 now released!
Great! We've been ahead in our fork and had already merged an earlier version of your PR hahaha But now we can sync more safely with master.
Unless you're going to work on this yourself in the next couple of weeks, I may have a go at a little refactor so that everything is ready for using Stanza's batch processing mode. I think it'd be simply a question of moving most of the current code in the __call__
method into a separate stanzadoc_to_spacydoc)
function (or similar). That way, __call__
would create a single stanza doc and convert it using this new function, and pipe
would create batches of stanza docs, and yield equally converted spacy docs from them. It could be setup such that activating Stanza's batch mode would be just changing a single line, from
processed_batch = [snlp_doc[txt] for txt in txt_batch]
to
processed_batch = stanza.pipe(txt_batch)
In any case, let me know if you want me to have a go or not...
Sure, I think the changes are pretty minimal. The main thing is the pipe
implementation like you described (should look similar to TrainablePipe.pipe
using minibatch
).
Language.pipe
doesn't use a pipe
method for the tokenizer (it only uses nlp.make_doc
, which is allowed to differ from tokenizer.__call__
in theory although I don't know that it ever does in practice), but it's something that could be considered so you can use nlp.pipe
rather than nlp.tokenizer.pipe
, since nlp.pipe
is much easier to use if you have additional pipe-containing components in your pipeline. It's possible we'll want to stick with the consistent use of nlp.make_doc
in nlp.pipe
, though, we'll just have to discuss it.
The current latest release works with Spacy 2.3. Is there a release planned soon that supports Spacy 3?