Fix recovery in datasets

MaxiBoether commented 3 months ago

Before, we had recovery logic based on reply indices. However, while working on storage, I realized those responses come non deterministically from multiple threads. Hence, we cannot rely on the ordering. We need to keep track of the sample IDs we already yielded. I changed the logic to just keep a list which is cheap to append to, and only convert to a set / hash table as soon as we failed once and we actually need to do many in checks.

github-actions[bot] commented 3 months ago

^{( % to main)} ^{( % to main)}

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 55.55556% with 20 lines in your changes missing coverage. Please review.

Project coverage is 82.37%. Comparing base (cb0be37) to head (6017c37).

Files	Patch %	Lines
...n/evaluator/internal/dataset/evaluation_dataset.py	50.00%	14 Missing :warning:
.../trainer_server/internal/dataset/online_dataset.py	64.70%	6 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #527 +/- ## ========================================== - Coverage 82.49% 82.37% -0.13% ========================================== Files 215 215 Lines 10054 10080 +26 ========================================== + Hits 8294 8303 +9 - Misses 1760 1777 +17 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

XianzheMa commented 3 months ago

Can you piggyback deleting this log https://github.com/eth-easl/modyn/blob/55fef7c88b66a61d375d1555db46682157f3bbf5/modyn/supervisor/internal/pipeline_executor/pipeline_executor.py#L480 in this PR? This log is basically wrong because len(s.triggers) is not the number of triggers we this time are processing.

eth-easl / modyn

Fix recovery in datasets #527

Codecov Report