Azure / azure-sdk-for-go

This repository is for active development of the Azure SDK for Go. For consumers of the SDK we recommend visiting our public developer docs at:
https://docs.microsoft.com/azure/developer/go/
MIT License
1.63k stars 837 forks source link

CosmosDB `(*ContainerClient).ReadItem()` not thread safe (any longer) #23052

Closed calmh closed 4 months ago

calmh commented 4 months ago

When upgrading from azcosmos 0.3.6 to 1.0.1 we see race detector warnings when concurrently doing (*ContainerClient).ReadItem(). We do this in several places for performance/concurrency, and so far it has worked well. Now, however it gives us this:

WARNING: DATA RACE
Write at 0x00c0002e59f0 by goroutine 143:
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*clientRetryPolicy).resetPolicyCounters()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_client_retry_policy.go:178 +0x70
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*clientRetryPolicy).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_client_retry_policy.go:30 +0x68
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*sharedKeyCredPolicy).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/shared_key_credential.go:122 +0x466
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*retryPolicy).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/runtime/policy_retry.go:121 +0x739
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*globalEndpointManagerPolicy).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_global_endpoint_manager_policy.go:32 +0x14b
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*headerPolicies).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_headers_policy.go:51 +0x7f8
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.telemetryPolicy.Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/runtime/policy_telemetry.go:70 +0x2a5
  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*telemetryPolicy).Do()
      <autogenerated>:1 +0x5a
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.includeResponsePolicy()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/runtime/policy_include_response.go:19 +0x2e
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.PolicyFunc.Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:177 +0x33
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.Pipeline.Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/pipeline.go:76 +0x8f
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).executeAndEnsureSuccessResponse()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_client.go:476 +0x113
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).sendGetRequest()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_client.go:357 +0x124
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*ContainerClient).ReadItem()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_container.go:350 +0x351
...

Previous write at 0x00c0002e59f0 by goroutine 142:
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*clientRetryPolicy).resetPolicyCounters()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_client_retry_policy.go:178 +0x70
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*clientRetryPolicy).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_client_retry_policy.go:30 +0x68
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*sharedKeyCredPolicy).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/shared_key_credential.go:122 +0x466
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*retryPolicy).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/runtime/policy_retry.go:121 +0x739
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*globalEndpointManagerPolicy).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_global_endpoint_manager_policy.go:32 +0x14b
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*headerPolicies).Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_headers_policy.go:51 +0x7f8
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.telemetryPolicy.Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/runtime/policy_telemetry.go:70 +0x2a5
  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*telemetryPolicy).Do()
      <autogenerated>:1 +0x5a
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.includeResponsePolicy()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/runtime/policy_include_response.go:19 +0x2e
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.PolicyFunc.Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:177 +0x33
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/request.go:107 +0x218
  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.Pipeline.Do()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.11.1/internal/exported/pipeline.go:76 +0x8f
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).executeAndEnsureSuccessResponse()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_client.go:476 +0x113
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).sendGetRequest()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_client.go:357 +0x124
  github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*ContainerClient).ReadItem()
      /home/runner/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v1.0.1/cosmos_container.go:350 +0x351
...

The actual race seems to be on a counter of sorts, I'm not sure if this is new and should be fixed, of if it's intended that it's not supported to make concurrent requests to ReadItem? The latter would be a bit of a bummer, we have a lot of requests coming in to the database layer and having to create a client for each of them seems inefficient.

github-actions[bot] commented 4 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @MehaKaushik @Pilchie @Wmengmsft.

Pilchie commented 4 months ago

@simorenoh @ealsur - any ideas?

ealsur commented 4 months ago

Interesting. This would indicate that the policy's instance is not tied to the request, but shared, which then means that the retry state:

https://github.com/Azure/azure-sdk-for-go/blob/c00d659e25b61d8cab9af614b2c598fe7b7b0734/sdk/data/azcosmos/cosmos_client_retry_policy.go#L178-L180

is wrong. And should be instead part of the request and extracted through req.OperationValue

ealsur commented 4 months ago

One alternative is to incorporate the retry counters into pipelineRequestOptions:

type retryContext struct {
        useWriteEndpoint       bool
    retryCount             int
    sessionRetryCount      int
    preferredLocationIndex int
}

type pipelineRequestOptions struct {
    ....
        retryContext retryContext 
}

And then use that to track across the client retry policy execution.

ealsur commented 4 months ago

@calmh We'll fix this. Working on a PR

ealsur commented 4 months ago

We'll release the new package on 6/17

calmh commented 4 months ago

Sorry to bump the issue again, just want to say I'm super impressed with the rapid handling and communication here. Top notch, peeps. 🙏

ealsur commented 4 months ago

@calmh Please confirm if your issue is resolved with 1.0.3

calmh commented 4 months ago

Yep, seems fine!