Open plaguss opened 3 months ago
Comparing cache-per-step
(bb28b0b) with develop
(a178109)
⚡ 1
improvements
Benchmark | develop |
cache-per-step |
Change | |
---|---|---|---|---|
⚡ | test_cache_time |
394.7 ms | 224.3 ms | +75.94% |
Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-766/
Description
This PR implements cache at step level.
Previously, we computed a signature for a pipeline, and when this signature changed, we recomputed everything. Now the idea is to compute the signature per step, and once the signature has changed, only recompute the steps whose signature (or preceding) has changed. So for a pipeline
A -> B -> C
, ifB
step changes, we will recompute onlyB
andC
, but we will start with the data we had fromA
.New cases we control with this change:
Ctrl+c
), we can restart from where we left.a >> b >> c >> d
and we change a step (sayc
), we will only recomputec
andd
.step
level. We have an argumentuse_cache
at the_Step
level, when set to False, the cache won't be used from that step onwards, even if the pipeline remains the same.Note: This has an impact in how we read the previous serialized parquet files, if any step's use_cache is set to False, for a pipeline that hasn't changed, we won't read the previous serialized content.
Closes #651