More robust gRPC connections, using `tenacity` where possible

MaxiBoether commented 2 weeks ago

Sometimes, we face outages/random disconnections during training. This fixes it in places where I encountered it last night. I tried to integrate tenacity as suggested by @robinholzi, but it's not always possible since the retry logic involves keeping track of already done work, which I don't want to put into class state

Part 6/n of porting over SIGMOD changes.

robinholzi commented 2 weeks ago

@MaxiBoether Not fully what the problem with partial result states is, but have you considered using retry contexts?

MaxiBoether commented 2 weeks ago

@MaxiBoether Not fully what the problem with partial result states is, but have you considered using retry contexts?

Ah, nope. I guess I can refactor the code to retry contexts then

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 71.59091% with 25 lines in your changes missing coverage. Please review.

Project coverage is 82.50%. Comparing base (75868bb) to head (ac1d863).

:exclamation: Current head ac1d863 differs from pull request most recent head 19cde94

Please upload reports for the commit 19cde94 to get more accurate results.

Files	Patch %	Lines
...n/evaluator/internal/dataset/evaluation_dataset.py	61.11%	14 Missing :warning:
modyn/supervisor/internal/grpc_handler.py	50.00%	5 Missing :warning:
.../internal/pipeline_executor/evaluation_executor.py	54.54%	5 Missing :warning:
.../trainer_server/internal/dataset/online_dataset.py	95.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #521 +/- ## ========================================== - Coverage 82.55% 82.50% -0.05% ========================================== Files 214 214 Lines 9975 10039 +64 ========================================== + Hits 8235 8283 +48 - Misses 1740 1756 +16 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

MaxiBoether commented 2 weeks ago

@MaxiBoether Not fully what the problem with partial result states is, but have you considered using retry contexts?

Thanks for suggesting that. I adapted the code. However, I needed to manually catch and reraise exceptions sometimes because I find their callback system to be quite ugly in object-oriented contexts (we need to write a static function that recovers self out of args[0] etc). I think this actually looks better than using callbacks, and it allows for easier logging with our logging infra. The retry logic itself is still more hidden now :)

github-actions[bot] commented 2 weeks ago

^{( % to main)} ^{( % to main)}

eth-easl / modyn

More robust gRPC connections, using `tenacity` where possible #521

Codecov Report