apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.27k stars 1.23k forks source link

`GrpcBrokerClusterIntegrationTest` is flaky #8684

Closed richardstartin closed 2 years ago

richardstartin commented 2 years ago

https://github.com/apache/pinot/runs/6393195568?check_suite_focus=true

walterddr commented 2 years ago

another probably related: https://github.com/apache/pinot/runs/6380255258?check_suite_focus=true

walterddr commented 2 years ago

the issue from https://github.com/apache/pinot/runs/6393195568?check_suite_focus=true seems to be a runner failure rather than related to GRPC.

...
2022-05-11T18:24:05.9718074Z 18:23:25.936 WARN [TimeBoundaryManager] [ClusterChangeHandlingThread] Failed to find segment with valid end time for table: mytable_OFFLINE, no time boundary generated
2022-05-11T18:24:05.9727643Z 18:23:41.210 WARN [TopStateHandoffReportStage] [HelixController-pipeline-default-GrpcBrokerClusterIntegrationTest-(f537ec4c_DEFAULT)] Event f537ec4c_DEFAULT : Cannot confirm top state missing start time. Use the current system time as the start time.
2022-05-11T18:24:05.9730929Z 18:23:49.201 WARN [TopStateHandoffReportStage] [HelixController-pipeline-default-GrpcBrokerClusterIntegrationTest-(65a70c02_DEFAULT)] Event 65a70c02_DEFAULT : Cannot confirm top state missing start time. Use the current system time as the start time.
2022-05-11T18:24:36.5164660Z [ERROR] Killed    
    <-------- [RR] Seem to be a transient runner failure???
2022-05-11T18:24:38.1386025Z [INFO] Running org.apache.pinot.integration.tests.access.CertBasedTlsChannelAccessControlFactory$CertBasedTlsChannelAccessControl$1
2022-05-11T18:24:38.5703576Z [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.356 s - in org.apache.pinot.integration.tests.access.CertBasedTlsChannelAccessControlFactory$CertBasedTlsChannelAccessControl$1
2022-05-11T18:24:39.3086423Z [INFO] Running org.apache.pinot.integration.tests.access.CertBasedTlsChannelAccessControlFactory$CertBasedTlsChannelAccessControl
...
walterddr commented 2 years ago

regarding the second one

2022-05-11T01:28:31.3764680Z 01:28:20.705 ERROR [StreamingSelectionOnlyCombineOperator] [grpc-default-executor-0] Timed out while polling results block (query: QueryContext{_tableName='mytable_OFFLINE', _subquery=null, _selectExpressions=[*], _aliasList=[null], _filter=null, _groupByExpressions=null, _havingFilter=null, _orderByExpressions=null, _limit=1000000, _offset=0, _queryOptions={}, _debugOptions=null, _expressionOverrideHints={}, _explain=false})

looks like a query timeout. we should set a higher timeout value for this select * with 110K row plan stream back results on GHA servers.

Jackie-Jiang commented 2 years ago

Another failure on master branch: https://github.com/apache/pinot/runs/6517186511?check_suite_focus=true The JVM crashed during the test. We should try to figure out what has caused the crash

richardstartin commented 2 years ago

I think we need to have an RCA before closing these, lots of these issues have been reopened.

walterddr commented 2 years ago

sorry I was confused regarding the detail of this issue. I thought we had a consensus based on no reply to my previous 2 comments

so I was only fixing the GRPCServer* test in https://github.com/apache/pinot/pull/8686. Let me take a look at the Broker one as well then. thx for reopening.

@richardstartin @Jackie-Jiang any idea how I can do a core/thread dump in github action ? i can use the same technique to stress test in 593a531ccc7a74cf33c626656229c56c693b23e1 for the broker but I am not sure how I can dump the state