eclipse-edc / Connector

EDC core services including data plane and control plane
Apache License 2.0
282 stars 236 forks source link

Bug when querying >5k results from PostgreSQL #3044

Closed richardtreier closed 1 year ago

richardtreier commented 1 year ago

Bug Report

Describe the Bug

When querying transfer processes, and there's about 6k rows in edc_transfer_process, using the management API, ?limit=4999 works, but ?limit=5000 fails.

With PostgreSQL errors such as ERROR: portal "C_3448" does not exist.

Expected Behavior

Observed Behavior

Database Errors

Steps to Reproduce

  1. Have an EDC with PostgreSQL persistence.
  2. Have about about 6k rows in edc_transfer_process, which has been reached in some productive connectors.
  3. /transferprocess?limit=4999 should work
  4. /transferprocess?limit=5000 should not work

Context Information

MS8 PostgreSQL 11

Detailed Description

An excerpt from our grafana logs:

[2023-05-19 09:12:54] [SEVERE ] JerseyExtension: Unexpected exception caught
org.eclipse.edc.spi.persistence.EdcPersistenceException: org.postgresql.util.PSQLException: ERROR: portal "C_1994" does not exist
    at org.eclipse.edc.sql.SqlQueryExecutor$1.tryAdvance(SqlQueryExecutor.java:123)
    at java.base/java.util.Spliterator.forEachRemaining(Spliterator.java:332)
2x      
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
2x      
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
2x      
    at org.eclipse.edc.connector.api.management.transferprocess.TransferProcessApiController.queryTransferProcesses(TransferProcessApiController.java:150)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
3x      
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
3x      
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
3x      
    at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:255)
3x      
    at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
9x      
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234)
3x      
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
5x      
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
    at org.eclipse.jetty.server.Server.handle(Server.java:563)
    at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
3x      
    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
    at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
3x      
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421)
    at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.lambda$new$0(AdaptiveExecutionStrategy.java:139)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:933)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1077)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.postgresql.util.PSQLException: ERROR: portal "C_1994" does not exist
3x      
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
    at org.postgresql.jdbc.PgResultSet.next(PgResultSet.java:2144)
    at org.eclipse.edc.sql.SqlQueryExecutor$1.tryAdvance(SqlQueryExecutor.java:117)
    ... 61 more
SebastianOpriel commented 1 year ago

Now a second connector ran into the same problem. I expect that more users will face the issue in the future.

ndr-brt commented 1 year ago

this is likely caused by the Statement fetchSize set at 5000: https://github.com/eclipse-edc/Connector/blob/905ef20363fcb65b989299ef85e80f4951212ca3/extensions/common/sql/sql-core/src/main/java/org/eclipse/edc/sql/SqlQueryExecutor.java#L94

My proposal could be to make it configurable, maybe raise the default value (but not too much as this should be something that should be tuned properly) and improve the error message.

Would you like to contribute to it?

ndr-brt commented 1 year ago

I'll work on this

ndr-brt commented 1 year ago

I'm not able to reproduce it, trying from an integration test added in the PostgresTransferProcessStoreTest:

    @Test
    void query_6000_items() {
        range(0, 6000).forEach(i -> getTransferProcessStore().updateOrCreate(createTransferProcess("test-neg-" + i)));
        var query = QuerySpec.Builder.newInstance().limit(6000).build();

        var result = getTransferProcessStore().findAll(query);

        assertThat(result).hasSize(6000);
    }

This works. I started postgres with this command:

docker run --rm --name edc-postgres -e POSTGRES_PASSWORD=password -p 5432:5432 -d postgres

I scrolled through the JDBC/PostgreSQL documentation about the fetchSize that was my first guess but in fact that's a way to eventually "tune" the query, not to put a limit on it.

So my suspects go to your postgres instance, does it have specific configuration? What script you used to create the schema? Unfortunately the internet does not speak a lot about this specific issue, but my suggestion is to try to run a connector with a fresh postgres instance and see if the issue appears again.

richardtreier commented 1 year ago

Nice to see some progress and tests!

Yes, the FetchSize query hint is supposed to be largely ignored by PostgreSQL and should thus not be set or set to -1 (https://www.phind.com/search?cache=6e83988d-b28e-4974-8a93-201367336839&init=true).

Another lead I had was that most of my search results pointed to problems with either Connection-reuse and setAutoCommit, as the integration test code mocks the creation of the connection in PostgresqlStoreSetupExtension#beforeEach.

So there's three possibilities as I see:

ndr-brt commented 1 year ago

I would prefer looking at the official documentation than some AI guessing:

Changing the code to use cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).

We need to use the cursor because this was the point of using streams, otherwise the catalog creation could affect the connector performances significantly (if not crash it). Same speech for setAutoCommit(false).

Having the fetchSize value configurable seems fair to me, setting a reasonable default value (as 5000) to keep the current (and tested) behavior working as expected.