Aiven-Open / pghoard

PostgreSQL® backup and restore service
http://aiven-open.github.io/pghoard/
Apache License 2.0
1.31k stars 96 forks source link

Add more retries in case of download failure #557

Closed rikonen closed 1 year ago

rikonen commented 1 year ago

In some rare cases 3 retries isn't enough but 6 is. This should really be made configurable but creating this pr as a quick fix in case nobody else feels like creating a better one.

rikonen commented 1 year ago

Those tests look somewhat unrelated since the failing tests don't even import the code that was changed. But looks like the failures are consistent and main hasn't failed with the same, though last test run for main was a long time ago so could be something has broken since.

rdunklau commented 1 year ago

It looks like the reasons the tests are failing are related to https://github.com/aiven/rohmu/pull/69.

A minor detail but it would be better to write the code differently: either use an else clause in the loop instead of testing for the "range - 1" attempt, or make it obvious. Eg, either:

max_retry = 6
for i in range(max_retry):
    ....
else:
    self.log.error("Download stalled despite retries, aborting")
    self.errors = 1

Or:

max_retry = 6
for i in range(max_retry):
   if i == max_retry -1:
       ...

I'm in favor of the first one.

alanfranz commented 1 year ago

I'll revert the change to rohmu, it was creating some issues in some dev environments as well.

rikonen commented 1 year ago

Added a constant for max retries to get rid of the magic numbers spread around the code.

ryansammonaiven commented 1 year ago

@rikonen I encountered a similar issue in this escalation, but I also had to increase the allowed stalled time to get it to complete.

alexole commented 1 year ago

rohmu with reverted change needs to be released in PyPi so that tests are passing, I guess @alanfranz is working on it.

codecov[bot] commented 1 year ago

Codecov Report

Merging #557 (1693152) into main (3be6acf) will increase coverage by 0.04%. The diff coverage is 100.00%.

Additional details and impacted files [![Impacted file tree graph](https://codecov.io/gh/aiven/pghoard/pull/557/graphs/tree.svg?width=650&height=150&src=pr&token=nLr7M7hvCx&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiven)](https://codecov.io/gh/aiven/pghoard/pull/557?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiven) ```diff @@ Coverage Diff @@ ## main #557 +/- ## ========================================== + Coverage 91.28% 91.33% +0.04% ========================================== Files 32 32 Lines 4671 4672 +1 ========================================== + Hits 4264 4267 +3 + Misses 407 405 -2 ``` | [Impacted Files](https://codecov.io/gh/aiven/pghoard/pull/557?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiven) | Coverage Δ | | |---|---|---| | [pghoard/restore.py](https://codecov.io/gh/aiven/pghoard/pull/557?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiven#diff-cGdob2FyZC9yZXN0b3JlLnB5) | `89.61% <100.00%> (-0.49%)` | :arrow_down: | | [pghoard/webserver.py](https://codecov.io/gh/aiven/pghoard/pull/557?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiven#diff-cGdob2FyZC93ZWJzZXJ2ZXIucHk=) | `89.19% <0.00%> (+0.22%)` | :arrow_up: | | [pghoard/pghoard.py](https://codecov.io/gh/aiven/pghoard/pull/557?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aiven#diff-cGdob2FyZC9wZ2hvYXJkLnB5) | `85.03% <0.00%> (+0.65%)` | :arrow_up: |