Sqllab stuck executing a query

aavanzyl commented 6 months ago

Bug description

When using SQL Lab to execute a query such as SELECT * from claims.claim limit 1, the UI shows Pending and never starts increasing the execution time.

The following is logged in the browser:

Execute: { "query": { "changed_on": "2024-04-02T07:16:18.493787", "dbId": 1, "db": "Amazon Redshift (Prod)", "endDttm": null, "errorMessage": null, "executedSql": null, "id": "9E6eWGcfB", "queryId": 26748, "limit": 1, "limitingFactor": "QUERY", "progress": 0, "rows": null, "schema": null, "ctas": false, "serverId": 26748, "sql": "SELECT * from claims.claim limit 1\n", "sqlEditorId": "1773", "startDttm": 1712042178240.671000, "state": "pending", "tab": "Untitled Query 1", "tempSchema": null, "tempTable": null, "userId": 1, "user": "Anton van Zyl", "resultsKey": null, "trackingUrl": null, "extra": {} } }

We get multiple updated since messages back such as: { "result": [ { "changed_on": "2024-04-02T07:16:18.242144", "ctas": false, "db": "Amazon Redshift (Prod)", "dbId": 1, "endDttm": null, "errorMessage": null, "executedSql": null, "extra": {}, "id": "9E6eWGcfB", "limit": 1000, "limitingFactor": "UNKNOWN", "progress": 0, "queryId": 26748, "resultsKey": null, "rows": null, "schema": null, "serverId": 26748, "sql": "SELECT * from claims.claim limit 1\n", "sqlEditorId": "1773", "startDttm": "1712042178240.671000", "state": "pending", "tab": "Untitled Query 1", "tempSchema": null, "tempTable": null, "trackingUrl": null, "user": "Anton van Zyl", "userId": 1 } ] }

But never receive and /results call back. My current remedy is to restart the instance and cache, which seems to intermittently solve the issue. This issue only started recently, with us updating to 3.1.0 and then 3.1.1.

I have checked the logs, nothing pops up as out of the ordinary.

How to reproduce the bug

Go to SQL Lab and run any query will result in the query being stuck in Pending

Screenshots/recordings

The current UI state being shown to the end user: Screenshot 2024-04-02 at 07 46 55

Superset version

3.1.1

Python version

I don't know

Node version

I don't know

Browser

Chrome

Additional context

We are running the latest 3.1.1 superset from a docker container hosted on ECS in AWS. The instance size has 12gigs of memory and connects to a Redshift database. Superset Caching is set up with Redis 7.1.0 running on elasticache in AWS.

Checklist

[X] I have searched Superset docs and Slack and didn't find a solution to my problem.
[X] I have searched the GitHub issue tracker and didn't find a similar bug report.
[X] I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

michael-s-molina commented 6 months ago

Hi @aavanzyl. Could you check if this problem happens in 3.1.2 which was released today? @justinpark for awareness.

aavanzyl commented 6 months ago

@michael-s-molina I have deployed 3.1.2 as asked. The issue currently is not showing, will monitor for a couple of days and let you know. Thank you for the response on my issue.

For further information, I can see that when clicking on run, it calls the execute API. This in turn takes a while to respond and was an empty response when broken. When things are working as expected, the execute API responds with a json object.

I also noticed that the updated_since API call started before the execute had a chance to receive a response, leading me to believe that this might contribute to the issue.

michael-s-molina commented 6 months ago

Thanks @aavanzyl. Let us know if things are working after your tests so we can close the issue.

aavanzyl commented 6 months ago

We have had 48 hours of stability with the issue not poping up. Its safe to say its solved. Thank you @michael-s-molina and team for the swift assistance. I appreciate the effort on your side.

aavanzyl commented 6 months ago

Hi guys, unfortunately after 6 days of stability, the issue has returned. This might be related to load in some way. The behaviour explained above seems better however, I no longer see empty execute being returned

What can I provide from my side to assist in solving the issue?

michael-s-molina commented 6 months ago

@justinpark could you help?

rusackas commented 4 months ago

@aavanzyl glad to see the issue closed, but curious what happened!

aavanzyl commented 4 months ago

@rusackas so just as a basic rundown

We upgraded to v4, which did not solve the issue yet, but improved it
I went through the helm charts and aligned some settings
- I changed my celery worker pool config to prefork, it was on gevent
- Removed some dependencies we had listed already being exposed by the dev container
- Moved to a S3 cache on the results
- Set the worker_prefetch_multiplier = 10 (Still not sure how this affects things)

Then the major change I made was split out the workers into their own container tasks on ECS, they where previously grouped together under one ECS service. I am now able to scale the workers based on load.

With all of the above together, we have been stable the last couple of months.

We initially followed this blog: https://www.obytes.com/blog/superset-deployment, which was honestly a bad starting point, since correcting to the right infrastructure, its been running a lot better

apache / superset