NuGet / Insights

Gather insights about public NuGet.org package data
Apache License 2.0
24 stars 7 forks source link

Transient network error leads to the wrong return value when releasing the Blob lease #71

Closed rmt2021 closed 11 months ago

rmt2021 commented 2 years ago

Version

18d694cc1f1bd72826e7ccbfead499beb4613116

Bug description

We found that transient network errors (e.g., network timeout or package drop) may happen when releasing the lease of a blob via ReleaseAsync SDK API, and there is no handling for the transient errors.

More precisely, if the effect has actually taken place in the server but the response of leaseClient.ReleaseAsync() never returns to the client on time due to transient network errors, the retry logic in Azure Storage SDK will send another release request again. Since the first release operation has released the lease successfully in the remote, there will be a Conflict error when the second request arrives.

The default value for shouldthrow is false in TryReleaseAsync, thus in the catch block, it will directly return false without any other handling. However, the operation indeed succeeded so it should return true instead of false. The following lines of code show the detail:

...
    try
    {
        await leaseClient.ReleaseAsync();
        return true;
    }
    catch (RequestFailedException ex) when (ex.Status == (int)HttpStatusCode.Conflict)
    {
        if (shouldThrow)
        {
            throw new InvalidOperationException(StorageLeaseResult.AcquiredBySomeoneElse, ex);
        }
        else
        {
            return false;
...

How to reproduce

As shown above, when the first release request sent by leaseClient.ReleaseAsync() in TryReleaseAsync succeeds in the remote but transient errors happen, the second retry request will lead to a Conflict error, which is handled wrongly.

Discussion

We should distinguish between the errors led by contention or transient error. For transient ones, we can ignore the exception and return true. We believe a similar bug also happens when shouldthrow is true because InvalidOperationException should not be thrown when transient errors happen.

joelverhagen commented 2 years ago

Yes, this is a good point. Thanks for pointing this out @rmt2021.

Conceptually, I agree. I think there is a spectrum of how complex the error handing should be.

One of the trade-offs I try to consider is complexity of code vs. added value due to this complexity. For a data analysis project like this, where the vast majority of the logic is performed off of a distributed queue, I believe the impact of this bug is that a unit of work will remain leased longer than necessary (e.g. perhaps 1 minute instead of 1 second). Retries of the distributed queue message should (hopefully) yield a successful result in the end except slower than what is ideal.

I'm happy to accept a PR resolving this concern but (unless I am missing something -- please correct me if I am) there isn't a totally broken scenario caused by it. It just affects throughput.

Thanks again for this issue (and your other high quality report)!

rmt2021 commented 2 years ago

Thank you for your reply!

Yeah, I like the point "there is a spectrum of how complex the error handing should be".

I agree that the consequence of this case may not be that severe, so I just add the transient error possibility in the BaseLeaseResult. However, we also found other Azure APIs may have similar problems, as mentioned in #77. Could you kindly take a look?

joelverhagen commented 11 months ago

I've further improved the transient error case by a) updating the etag each time the lease is acquired via a metadata update and b) checking for the known etag during release. https://github.com/NuGet/Insights/blob/5d0b5ed62a71b10d93d77f5135b4876a1b1bf4dc/src/Logic/Leasing/StorageLeaseService.cs#L169-L173

I think this is "good enough" for now.