Closed rmt2021 closed 11 months ago
Yes, this is a good point. Thanks for pointing this out @rmt2021.
Conceptually, I agree. I think there is a spectrum of how complex the error handing should be.
One of the trade-offs I try to consider is complexity of code vs. added value due to this complexity. For a data analysis project like this, where the vast majority of the logic is performed off of a distributed queue, I believe the impact of this bug is that a unit of work will remain leased longer than necessary (e.g. perhaps 1 minute instead of 1 second). Retries of the distributed queue message should (hopefully) yield a successful result in the end except slower than what is ideal.
I'm happy to accept a PR resolving this concern but (unless I am missing something -- please correct me if I am) there isn't a totally broken scenario caused by it. It just affects throughput.
Thanks again for this issue (and your other high quality report)!
Thank you for your reply!
Yeah, I like the point "there is a spectrum of how complex the error handing should be".
I agree that the consequence of this case may not be that severe, so I just add the transient error possibility in the BaseLeaseResult. However, we also found other Azure APIs may have similar problems, as mentioned in #77. Could you kindly take a look?
I've further improved the transient error case by a) updating the etag each time the lease is acquired via a metadata update and b) checking for the known etag during release. https://github.com/NuGet/Insights/blob/5d0b5ed62a71b10d93d77f5135b4876a1b1bf4dc/src/Logic/Leasing/StorageLeaseService.cs#L169-L173
I think this is "good enough" for now.
Version
18d694cc1f1bd72826e7ccbfead499beb4613116
Bug description
We found that transient network errors (e.g., network timeout or package drop) may happen when releasing the lease of a blob via
ReleaseAsync
SDK API, and there is no handling for the transient errors.More precisely, if the effect has actually taken place in the server but the response of
leaseClient.ReleaseAsync()
never returns to the client on time due to transient network errors, the retry logic in Azure Storage SDK will send another release request again. Since the first release operation has released the lease successfully in the remote, there will be a Conflict error when the second request arrives.The default value for
shouldthrow
is false inTryReleaseAsync
, thus in the catch block, it will directly return false without any other handling. However, the operation indeed succeeded so it should return true instead of false. The following lines of code show the detail:How to reproduce
As shown above, when the first release request sent by
leaseClient.ReleaseAsync()
inTryReleaseAsync
succeeds in the remote but transient errors happen, the second retry request will lead to a Conflict error, which is handled wrongly.Discussion
We should distinguish between the errors led by contention or transient error. For transient ones, we can ignore the exception and return true. We believe a similar bug also happens when
shouldthrow
is true becauseInvalidOperationException
should not be thrown when transient errors happen.