Closed GuyAglionby closed 11 months ago
Hi,
Thanks for reaching out!
There's not anything in the docs currently unfortunately, we've had some docs on batch processing in progress for a while but unfortunately not complete yet.
Are you sending a large number of documents/large documents to the kazu pipeline? If you're sending a small number of documents, it should only be a few seconds.
If you're sending a large number of documents, one option is to 'batch'your documents to kazu in a for loop, and wrap it with tqdm which will give you a progress bar of how many documents/batches you've run through the pipeline.
You could do the batching with e.g. the chunked
recipe from more_itertools
.
Let me know if any of that is unclear.
Is that good enough for what you're after, or are you running kazu in a way that each step is taking a long time for even a small number of documents?
Thanks for the quick response! I'm sending ~50k docs, but they're not very big. I've batched them into reasonably small chunks before feeding them to the pipeline as you suggest, and that's working nicely.
Thanks again
glad I could help!
I was curious (possibly nosy) about how kazu might be being used and had an explore of your GitHub profile and website. Looks like you're doing interesting stuff!
Would be happy/interested to know how well kazu suits your needs/what shortcomings it currently has for your usage, to see if we can fix/improve them. So feel free to open more issues, or even reach out directly if it's less an 'issue'/more ambiguous - you can see my work email in the kazu git history, e.g. looking at this git commit: https://github.com/AstraZeneca/KAZU/commit/ccb99ddbcf63633be31f8c977b658aff25ef38c8.patch
Thanks for this useful library! I wonder if there's a recommendation for how to track progress of the pipeline, and also check progress through each stage? It'd be useful to know how long I should expect to wait for it to complete. I'm using the default pipeline.
Thanks in advance for any pointers