Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.36k stars 2k forks source link

[BUG] "Resource Not Found" errors #6163

Closed dkroehan closed 4 years ago

dkroehan commented 5 years ago

Describe the bug We have integration tests that run tests against a real CosmosDB. It basically tests that the queries are working as expected. In some runs we are getting "Resource Not Found" errors. This occurs both in our CI/CD pipeline and locally.

We have a DatabaseInitializer class, which sets up the Cosmos database, creates containers and create some stored procedures during the application startup.

In some cases the "Resource Not Found" error already occurs during the setup, but most of the time it occurs when running the queries or adding/cleaning up test data.

Exception or Stack Trace This is the stack trace from the database setup:

12:16:52.459 [reactor-http-epoll-8] ERROR .....DatabaseInitializer - Error when creating stored procedures
com.azure.data.cosmos.CosmosClientException: Message: {"Errors":["Resource Not Found"]}
ActivityId: 4a445bec-9e42-493c-8ebe-26b6cc646c62, Request URI: /apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132174287216593763s, RequestStats: 
RequestStartTime: 2019-11-05T12:16:52.4270506Z, RequestEndTime: 2019-11-05T12:16:52.4270506Z,  Number of regions attempted:1
ResponseTime: 2019-11-05T12:16:52.4270506Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.21:11000/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132174287216593763s, LSN: 16068, GlobalCommittedLsn: 16068, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#16068, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
ResponseTime: 2019-11-05T12:16:52.4270506Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.32:11000/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132173994761905544s, LSN: 16068, GlobalCommittedLsn: 16068, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#16068, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
, SDK: Microsoft.Azure.Documents.Common/2.7.0, StatusCode: NotFound
    at com.azure.data.cosmos.BridgeInternal.createCosmosClientException(BridgeInternal.java:350)
    at com.azure.data.cosmos.internal.RxGatewayStoreModel.validateOrThrow(RxGatewayStoreModel.java:388)
    at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$null$5(RxGatewayStoreModel.java:326)

This is a stack trace when the database is setup successfully and the test is adding some test data before executing a query:

[INFO] Running MyIntegrationTest
12:17:11.339 [reactor-http-epoll-8] ERROR c.a.d.c.internal.ClientRetryPolicy - locationEndpoint is null because ClientRetryPolicy::onBeforeRequest(.) is not invoked, probably request creation failed due to invalid options, serialization setting, etc.
12:17:11.339 [reactor-http-epoll-8] ERROR c.a.d.c.i.RenameCollectionAwareClientRetryPolicy - onBeforeSendRequest is not invoked, encountered failure due to request being null
com.azure.data.cosmos.CosmosClientException: Message: {"Errors":["Resource Not Found"]}
ActivityId: 5f36d953-1915-4abc-832d-5fca0b7d1e79, Request URI: /apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132173994761905544s, RequestStats: 
RequestStartTime: 2019-11-05T12:17:11.3248995Z, RequestEndTime: 2019-11-05T12:17:11.3248995Z,  Number of regions attempted:1
ResponseTime: 2019-11-05T12:17:11.3248995Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.32:11000/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132173994761905544s, LSN: 16076, GlobalCommittedLsn: 16076, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#16076, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
ResponseTime: 2019-11-05T12:17:11.3248995Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.23:11300/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132174270628133776s, LSN: 16076, GlobalCommittedLsn: 16076, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#16076, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
, SDK: Microsoft.Azure.Documents.Common/2.7.0, StatusCode: NotFound
    at com.azure.data.cosmos.BridgeInternal.createCosmosClientException(BridgeInternal.java:350)
    at com.azure.data.cosmos.internal.RxGatewayStoreModel.validateOrThrow(RxGatewayStoreModel.java:388)
    at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$null$5(RxGatewayStoreModel.java:326)

To Reproduce This is tricky, since it occurs kind of randomly.

Setup (please complete the following information): IDE : IntelliJ Java SDK: v2.4.4 and SDK v3.2.2

Additional context If this is not a bug, it would be really great to understand what is wrong with the setup.

ghost commented 5 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagarathapar @srinathnarayanan @southpolesteve @shurd

ghost commented 5 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagarathapar @srinathnarayanan @southpolesteve @shurd

kirankumarkolli commented 5 years ago

@dkroehan is the above error with the emulator or production cosmos account?

dkroehan commented 5 years ago

@kirankumarkolli The error is with a production cosmos account.

kirankumarkolli commented 5 years ago
12:16:52.459 [reactor-http-epoll-8] ERROR .....DatabaseInitializer - Error when creating stored procedures
com.azure.data.cosmos.CosmosClientException: Message: {"Errors":["Resource Not Found"]}
ActivityId: 4a445bec-9e42-493c-8ebe-26b6cc646c62, Request URI: /apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132174287216593763s, RequestStats: 
RequestStartTime: 2019-11-05T12:16:52.4270506Z, RequestEndTime: 2019-11-05T12:16:52.4270506Z,  Number of regions attempted:1
ResponseTime: 2019-11-05T12:16:52.4270506Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.21:11000/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132174287216593763s, LSN: 16068, GlobalCommittedLsn: 16068, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#16068, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
ResponseTime: 2019-11-05T12:16:52.4270506Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.32:11000/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132173994761905544s, LSN: 16068, GlobalCommittedLsn: 16068, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#16068, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
, SDK: Microsoft.Azure.Documents.Common/2.7.0, StatusCode: NotFound
    at com.azure.data.cosmos.BridgeInternal.createCosmosClientException(BridgeInternal.java:350)
    at com.azure.data.cosmos.internal.RxGatewayStoreModel.validateOrThrow(RxGatewayStoreModel.java:388)
    at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$null$5(RxGatewayStoreModel.java:326)

It mostly means that the resource hierarchy is not present. I.e. either database/container are not present. One possibility is concurrent CRUD on database/container. Is that a possibility in your workload?

Second failure also seems similar but it also exposed an assertion. @kushagraThapar can you please create a follow-up work-item to track missing OnBefore call?

kushagraThapar commented 5 years ago
12:16:52.459 [reactor-http-epoll-8] ERROR .....DatabaseInitializer - Error when creating stored procedures
com.azure.data.cosmos.CosmosClientException: Message: {"Errors":["Resource Not Found"]}
ActivityId: 4a445bec-9e42-493c-8ebe-26b6cc646c62, Request URI: /apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132174287216593763s, RequestStats: 
RequestStartTime: 2019-11-05T12:16:52.4270506Z, RequestEndTime: 2019-11-05T12:16:52.4270506Z,  Number of regions attempted:1
ResponseTime: 2019-11-05T12:16:52.4270506Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.21:11000/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132174287216593763s, LSN: 16068, GlobalCommittedLsn: 16068, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#16068, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
ResponseTime: 2019-11-05T12:16:52.4270506Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.32:11000/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132173994761905544s, LSN: 16068, GlobalCommittedLsn: 16068, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#16068, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
, SDK: Microsoft.Azure.Documents.Common/2.7.0, StatusCode: NotFound
  at com.azure.data.cosmos.BridgeInternal.createCosmosClientException(BridgeInternal.java:350)
  at com.azure.data.cosmos.internal.RxGatewayStoreModel.validateOrThrow(RxGatewayStoreModel.java:388)
  at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$null$5(RxGatewayStoreModel.java:326)

It mostly means that the resource hierarchy is not present. I.e. either database/container are not present. One possibility is concurrent CRUD on database/container. Is that a possibility in your workload?

Second failure also seems similar but it also exposed an assertion. @kushagraThapar can you please create a follow-up work-item to track missing OnBefore call?

Sure, will do that.

kirankumarkolli commented 5 years ago

It mostly means that the resource hierarchy is not present. I.e. either database/container are not present. One possibility is concurrent CRUD on database/container. Is that a possibility in your workload?

@dkroehan can you please validate above?

dkroehan commented 5 years ago

We checked our DatabaseInitializer class and yes there is some concurrent CRUD when creating containers after the database was setup successfully. Here is the method we use: (I just renamed some variables.)

    private void ensureDatabase() {
        String databaseId = getDatabaseId();
        String container1Id = getContainer1Id();
        String container2Id = getContainer2Id();

        CosmosDatabaseResponse cosmosDatabaseResponse = client.createDatabaseIfNotExists(databaseId)
                .doOnSuccess(ignored -> LOGGER.debug("Database: '{}' available", databaseId))
                .doOnError(throwable -> LOGGER.error("Error when creating/using existing database", throwable))
                .block();
        CosmosDatabase database = cosmosDatabaseResponse.database();

        database.createContainerIfNotExists(createContainer1Properties(container1Id))
                .doOnSuccess(cosmosContainerResponse -> LOGGER.debug("Container: '{}' available", container1Id))
                .doOnError(throwable -> LOGGER.error("Error when creating/using existing container", throwable))
                .flatMapMany(response -> createStoredProcedures(response, storedProcedures))
                .doOnError(throwable -> LOGGER.error("Error when creating/using existing stored procedures", throwable))
                .publishOn(Schedulers.elastic())
                .blockLast();

        database.createContainerIfNotExists(createContainer2Properties(container2Id))
                .doOnSuccess(cosmosContainerResponse -> LOGGER.debug("Container: '{}' available", container2Id))
                .doOnError(throwable -> LOGGER.error("Error when creating/using existing container", throwable))
                .publishOn(Schedulers.elastic())
                .block();
    }

We removed the publishOn(Schedulers.elastic()) and right now it seems to solve that one.

Good to know that one should not create containers in parallel.

dkroehan commented 5 years ago

So there is still the "Resource Not Found" error regarding the ClientRetryPolicy, that I posted in the first post.

And there is one more I got from our logs. It is from an integration test that tests a bulk-delete stored-procedure.

reactor.core.Exceptions$ReactiveException: 
CosmosClientException{error={"code":"NotFound","message":"Message: {\"Errors\":[\"Resource Not Found\"]}\r\nActivityId: ba81b6df-e52c-4360-9db5-016b8bd4bb35, Request URI: /apps/1f769f8c-2da8-4eb4-987b-3bbdbd761a1f/services/149cb50c-514c-4c09-b82f-e8ed63c3dfbd/partitions/0931392a-b005-4c66-b75d-2c1f8987ef65/replicas/132182001009793585p/, RequestStats: \r\nRequestStartTime: 2019-11-14T11:34:06.1578194Z, RequestEndTime: 2019-11-14T11:34:06.1578194Z,  Number of regions attempted:1\r\nResponseTime: 2019-11-14T11:34:06.1578194Z, StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-westeurope1-fd18.documents.azure.com:14025/apps/1f769f8c-2da8-4eb4-987b-3bbdbd761a1f/services/149cb50c-514c-4c09-b82f-e8ed63c3dfbd/partitions/0931392a-b005-4c66-b75d-2c1f8987ef65/replicas/132182001009793585p/, LSN: 64, GlobalCommittedLsn: 64, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 0.62, ItemLSN: -1, SessionToken: -1#64, UsingLocalLSN: False, TransportException: null, ResourceType: StoredProcedure, OperationType: ExecuteJavaScript\r\n, SDK: Microsoft.Azure.Documents.Common/2.7.0, StatusCode: NotFound","additionalErrorInfo":null}, resourceAddress='null', statusCode=404, message=Message: {"Errors":["Resource Not Found"]}
ActivityId: ba81b6df-e52c-4360-9db5-016b8bd4bb35, Request URI: /apps/1f769f8c-2da8-4eb4-987b-3bbdbd761a1f/services/149cb50c-514c-4c09-b82f-e8ed63c3dfbd/partitions/0931392a-b005-4c66-b75d-2c1f8987ef65/replicas/132182001009793585p/, RequestStats: 
RequestStartTime: 2019-11-14T11:34:06.1578194Z, RequestEndTime: 2019-11-14T11:34:06.1578194Z,  Number of regions attempted:1
ResponseTime: 2019-11-14T11:34:06.1578194Z, StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-westeurope1-fd18.documents.azure.com:14025/apps/1f769f8c-2da8-4eb4-987b-3bbdbd761a1f/services/149cb50c-514c-4c09-b82f-e8ed63c3dfbd/partitions/0931392a-b005-4c66-b75d-2c1f8987ef65/replicas/132182001009793585p/, LSN: 64, GlobalCommittedLsn: 64, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 0.62, ItemLSN: -1, SessionToken: -1#64, UsingLocalLSN: False, TransportException: null, ResourceType: StoredProcedure, OperationType: ExecuteJavaScript
, SDK: Microsoft.Azure.Documents.Common/2.7.0, StatusCode: NotFound, causeInfo=null, responseHeaders={Transfer-Encoding=chunked, x-ms-last-state-change-utc=Thu, 14 Nov 2019 10:15:05.801 GMT, x-ms-current-replica-set-size=4, Server=Microsoft-HTTPAPI/2.0, x-ms-session-token=0:-1#64, lsn=64, x-ms-request-charge=0.62, x-ms-schemaversion=1.8, x-ms-transport-request-id=5, x-ms-number-of-read-regions=0, x-ms-current-write-quorum=3, x-ms-cosmos-quorum-acked-llsn=64, x-ms-quorum-acked-lsn=64, Date=Thu, 14 Nov 2019 11:34:05 GMT, x-ms-activity-id=ba81b6df-e52c-4360-9db5-016b8bd4bb35, Strict-Transport-Security=max-age=31536000, x-ms-xp-role=1, x-ms-global-Committed-lsn=64, x-ms-cosmos-llsn=64, x-ms-gatewayversion=version=2.7.0, x-ms-serviceversion=version=2.7.0.0, Content-Type=application/json}, requestHeaders={authorization=type%3Dmaster%26ver%3D1.0%26sig%3DvTBHvnnHwNHN6QSQ1%2BSocdOcg4A2fxnm%2BMRGPaV90Qw%3D, Accept=application/json, x-ms-date=Thu, 14 Nov 2019 11:34:06 GMT, x-ms-documentdb-partitionkey=["286a1129__286a1129"], x-ms-consistency-level=Session, Content-Type=application/json}}
    at ------ removed --------
Caused by: com.azure.data.cosmos.CosmosClientException: 
Message: {"Errors":["Resource Not Found"]}
ActivityId: ba81b6df-e52c-4360-9db5-016b8bd4bb35, Request URI: /apps/1f769f8c-2da8-4eb4-987b-3bbdbd761a1f/services/149cb50c-514c-4c09-b82f-e8ed63c3dfbd/partitions/0931392a-b005-4c66-b75d-2c1f8987ef65/replicas/132182001009793585p/, RequestStats: 
RequestStartTime: 2019-11-14T11:34:06.1578194Z, RequestEndTime: 2019-11-14T11:34:06.1578194Z,  Number of regions attempted:1
ResponseTime: 2019-11-14T11:34:06.1578194Z, StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-westeurope1-fd18.documents.azure.com:14025/apps/1f769f8c-2da8-4eb4-987b-3bbdbd761a1f/services/149cb50c-514c-4c09-b82f-e8ed63c3dfbd/partitions/0931392a-b005-4c66-b75d-2c1f8987ef65/replicas/132182001009793585p/, LSN: 64, GlobalCommittedLsn: 64, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 0.62, ItemLSN: -1, SessionToken: -1#64, UsingLocalLSN: False, TransportException: null, ResourceType: StoredProcedure, OperationType: ExecuteJavaScript
, SDK: Microsoft.Azure.Documents.Common/2.7.0, StatusCode: NotFound
dkroehan commented 5 years ago

We checked our DatabaseInitializer class and yes there is some concurrent CRUD when creating containers after the database was setup successfully. Here is the method we use: (I just renamed some variables.)

    private void ensureDatabase() {
        String databaseId = getDatabaseId();
        String container1Id = getContainer1Id();
        String container2Id = getContainer2Id();

        CosmosDatabaseResponse cosmosDatabaseResponse = client.createDatabaseIfNotExists(databaseId)
                .doOnSuccess(ignored -> LOGGER.debug("Database: '{}' available", databaseId))
                .doOnError(throwable -> LOGGER.error("Error when creating/using existing database", throwable))
                .block();
        CosmosDatabase database = cosmosDatabaseResponse.database();

        database.createContainerIfNotExists(createContainer1Properties(container1Id))
                .doOnSuccess(cosmosContainerResponse -> LOGGER.debug("Container: '{}' available", container1Id))
                .doOnError(throwable -> LOGGER.error("Error when creating/using existing container", throwable))
                .flatMapMany(response -> createStoredProcedures(response, storedProcedures))
                .doOnError(throwable -> LOGGER.error("Error when creating/using existing stored procedures", throwable))
                .publishOn(Schedulers.elastic())
                .blockLast();

        database.createContainerIfNotExists(createContainer2Properties(container2Id))
                .doOnSuccess(cosmosContainerResponse -> LOGGER.debug("Container: '{}' available", container2Id))
                .doOnError(throwable -> LOGGER.error("Error when creating/using existing container", throwable))
                .publishOn(Schedulers.elastic())
                .block();
    }

We removed the publishOn(Schedulers.elastic()) and right now it seems to solve that one.

Good to know that one should not create containers in parallel.

Unfortunately we still see "Resource Not Found" errors when setting up CosmosDB. Errors are less, but they still occur.

Here is the stacktrace:

20:58:22.873 [reactor-http-epoll-1] ERROR DatabaseInitializer - Error when creating/using existing stored procedures
com.azure.data.cosmos.CosmosClientException: Message: {"Errors":["Resource Not Found"]}
ActivityId: 44fd67d3-0f81-4bfa-8810-9ca22f494175, Request URI: /apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132187246529389581s, RequestStats: 
RequestStartTime: 2019-11-20T20:58:22.8454700Z, RequestEndTime: 2019-11-20T20:58:22.8555060Z,  Number of regions attempted:1
ResponseTime: 2019-11-20T20:58:22.8555060Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.36:11000/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132187246529389581s, LSN: 19421, GlobalCommittedLsn: 19421, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#19421, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
ResponseTime: 2019-11-20T20:58:22.8555060Z, StoreResult: StorePhysicalAddress: rntbd://10.0.0.30:11000/apps/3d31f75e-20f5-4083-946c-a7410e147abb/services/edff4562-9097-4a75-8a25-5c77b6e88a57/partitions/42e35da6-ebce-41b0-8e41-03dfbb3ad2da/replicas/132187489217457376s, LSN: 19421, GlobalCommittedLsn: 19421, PartitionKeyRangeId: , IsValid: True, StatusCode: 404, SubStatusCode: 0, RequestCharge: 1, ItemLSN: -1, SessionToken: -1#19421, UsingLocalLSN: False, TransportException: null, ResourceType: Collection, OperationType: Read
, SDK: Microsoft.Azure.Documents.Common/2.7.0, StatusCode: NotFound
    at com.azure.data.cosmos.BridgeInternal.createCosmosClientException(BridgeInternal.java:350)
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Assembly trace from producer [reactor.core.publisher.FluxFlatMap] :
    reactor.core.publisher.Flux.flatMap
    com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$toDocumentServiceResponse$1(RxGatewayStoreModel.java:266)
Error has been observed at the following site(s):
    |_       Flux.flatMap ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$toDocumentServiceResponse$1(RxGatewayStoreModel.java:266)
    |_        Flux.single ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$toDocumentServiceResponse$1(RxGatewayStoreModel.java:281)
    |_       Mono.flatMap ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.toDocumentServiceResponse(RxGatewayStoreModel.java:246)
    |_           Mono.map ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.toDocumentServiceResponse(RxGatewayStoreModel.java:283)
    |_         Mono.error ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$toDocumentServiceResponse$2(RxGatewayStoreModel.java:301)
    |_ Mono.onErrorResume ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.toDocumentServiceResponse(RxGatewayStoreModel.java:284)
    |_          Mono.flux ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.toDocumentServiceResponse(RxGatewayStoreModel.java:302)
    |_                    ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.doCreate(RxGatewayStoreModel.java:86)
    |_                    ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.invokeAsyncInternal(RxGatewayStoreModel.java:342)
    |_        Flux.single ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$invokeAsync$3(RxGatewayStoreModel.java:365)
    |_                    ⇢ at com.azure.data.cosmos.internal.BackoffRetryUtility.lambda$executeRetry$0(BackoffRetryUtility.java:34)
    |_         Mono.defer ⇢ at com.azure.data.cosmos.internal.BackoffRetryUtility.executeRetry(BackoffRetryUtility.java:31)
    |_       Flux.flatMap ⇢ at com.azure.data.cosmos.internal.RetryUtils.lambda$toRetryWhenFunc$2(RetryUtils.java:22)
    |_     Mono.retryWhen ⇢ at com.azure.data.cosmos.internal.BackoffRetryUtility.executeRetry(BackoffRetryUtility.java:38)
    |_          Mono.flux ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.invokeAsync(RxGatewayStoreModel.java:366)
    |_         Flux.error ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$processMessage$4(RxGatewayStoreModel.java:394)
    |_ Flux.onErrorResume ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.processMessage(RxGatewayStoreModel.java:375)
    |_           Flux.map ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.processMessage(RxGatewayStoreModel.java:396)
    |_                    ⇢ at com.azure.data.cosmos.internal.RxDocumentClientImpl.create(RxDocumentClientImpl.java:1043)
    |_           Flux.map ⇢ at com.azure.data.cosmos.internal.RxDocumentClientImpl.createStoredProcedureInternal(RxDocumentClientImpl.java:1461)
    |_                    ⇢ at com.azure.data.cosmos.internal.RxDocumentClientImpl.lambda$createStoredProcedure$38(RxDocumentClientImpl.java:1442)
    |_        Flux.single ⇢ at com.azure.data.cosmos.internal.ObservableHelper.lambda$inlineIfPossibleAsObs$1(ObservableHelper.java:44)
    |_                    ⇢ at com.azure.data.cosmos.internal.BackoffRetryUtility.lambda$executeRetry$0(BackoffRetryUtility.java:34)
    |_         Mono.defer ⇢ at com.azure.data.cosmos.internal.BackoffRetryUtility.executeRetry(BackoffRetryUtility.java:31)
    |_       Flux.flatMap ⇢ at com.azure.data.cosmos.internal.RetryUtils.lambda$toRetryWhenFunc$1(RetryUtils.java:28)
    |_       Flux.flatMap ⇢ at com.azure.data.cosmos.internal.RetryUtils.lambda$toRetryWhenFunc$2(RetryUtils.java:22)
    |_     Mono.retryWhen ⇢ at com.azure.data.cosmos.internal.BackoffRetryUtility.executeRetry(BackoffRetryUtility.java:38)
    |_          Mono.flux ⇢ at com.azure.data.cosmos.internal.ObservableHelper.inlineIfPossibleAsObs(ObservableHelper.java:44)
    |_                    ⇢ at com.azure.data.cosmos.internal.RxDocumentClientImpl.createStoredProcedure(RxDocumentClientImpl.java:1442)
    |_           Flux.map ⇢ at com.azure.data.cosmos.CosmosScripts.createStoredProcedure(CosmosScripts.java:58)
    |_        Flux.single ⇢ at com.azure.data.cosmos.CosmosScripts.createStoredProcedure(CosmosScripts.java:59)
Stack trace:
        at com.azure.data.cosmos.BridgeInternal.createCosmosClientException(BridgeInternal.java:350)
        at com.azure.data.cosmos.internal.RxGatewayStoreModel.validateOrThrow(RxGatewayStoreModel.java:333)
        at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$toDocumentServiceResponse$0(RxGatewayStoreModel.java:270)
        at reactor.core.publisher.FluxFlatMap$FlatMapMain.onNext(FluxFlatMap.java:378)
        at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:67)
        at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onNext(FluxDoFinally.java:123)
        at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:178)
        at reactor.core.publisher.FluxMapFuseable$MapFuseableConditionalSubscriber.onNext(FluxMapFuseable.java:287)
        at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onNext(FluxDoFinally.java:123)
        at reactor.core.publisher.FluxFilterFuseable$FilterFuseableConditionalSubscriber.onNext(FluxFilterFuseable.java:330)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1592)
        at reactor.core.publisher.MonoReduceSeed$ReduceSeedSubscriber.onComplete(MonoReduceSeed.java:156)
        at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onComplete(FluxMapFuseable.java:144)
        at reactor.netty.channel.FluxReceive.terminateReceiver(FluxReceive.java:397)
        at reactor.netty.channel.FluxReceive.drainReceiver(FluxReceive.java:197)
        at reactor.netty.channel.FluxReceive.onInboundComplete(FluxReceive.java:345)
        at reactor.netty.channel.ChannelOperations.onInboundComplete(ChannelOperations.java:363)
        at reactor.netty.channel.ChannelOperations.terminate(ChannelOperations.java:412)
        at reactor.netty.http.client.HttpClientOperations.onInboundNext(HttpClientOperations.java:556)
        at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:91)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302)
        at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224)
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:483)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:383)
        at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:834)
kirankumarkolli commented 5 years ago

@dkroehan just to be on same-page ensureDatabase() is called before each API call, hence its concurrent CRUD?

Collection/Database operations are expensive, is it possible to limit them to only initialization path?

dkroehan commented 4 years ago

@kirankumarkolli We continued investigating the issue on our side and it seems that we found the issue.

It was a combination of multiple issues in the way the integration tests were executed. I also found out that it is not a good idea to drop a collection and create a collection with the same name immediately afterwards.

I will close that issue. Thx for your helpful input!