Open NISH1001 opened 1 year ago
As far as data consistency, would it be possible to enforce checks (for tokenizing) on the DTOs during, say pipeline.build()
? that way you're guaranteed to not have any in pipeline.run()
. essentially split the execution of pipeline into two, where in first part you handle all the checks. and in the final part you run the pipeline.
As far as caching, I think the mechanism will be useful nevertheless.
I like the build(...)
mechanism. Will add to my to-do list. Right now, what we're basically doing is passing texts as it is to the transformers.pipeline(...)
which implicitly handles all the tokenization, forward-pass, etc.
What
Currently,
evalem.pipelines.SimpleEvaluationPipeline
is stateless. That means any forward passes (including inferencing and evaluation results) aren't cached within the pipeline object. This is fine for inference+evaluation on a small sample size. However, for a bigger size, say full-on squad v2 86k train samples, re-running the inference to get predictions is time-consuming when we want to switch the Evaluator object.Why
To speed up evaluation without re-running forward pass on a huge dataset. This can also help in debugging for such large samples because for large samples it's a bummer to catch the runtime errors (say tokenization error relating to weird texts, etc) at a late stage during the pipeline.
How
Maybe, we can have a new
CachedSimpleEvaluationPipeline
or something like that to be able to load predictions from external files (text, JSON, etc.)cc: @muthukumaranR