We previously explicitly choose to not do retry on error for NN uploads. Errors do happen and it's a shame to lose a full training for a temporary upload error.
However, we should have specific settings for that large upload:
longer global timeout
don't retry 10 times, maybe 3 is enough
less retry, so more wait between retries (maybe 10s after first failure, then 30s?)
We previously explicitly choose to not do retry on error for NN uploads. Errors do happen and it's a shame to lose a full training for a temporary upload error.
However, we should have specific settings for that large upload: