Huey DB encounters race condition when Flask server is under load

Dewberry / ripple1d

Utilities for reuse of HEC-RAS models for NWM

https://ripple1d.readthedocs.io/en/latest/

MIT License

3 stars 0 forks source link

Huey DB encounters race condition when Flask server is under load #99

Open ar-siddiqui opened 1 month ago

ar-siddiqui commented 1 month ago

Getting errors when launching many jobs in one go.

  File "D:\Users\abdul.siddiqui\venvs\ripple-py312\Lib\site-packages\huey\storage.py", line 650, in db
    if commit: cursor.execute(self.begin_sql)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.OperationalError: database is locked

SQLite can have only one write connection at a time, which might be the cause of this error, there should be one connection object only and it must be passed around to different functions.

slawler commented 1 month ago

@mxkpp do we have a setting / parameter for HEC-RAS core utilization? I wonder if it is greedy and that could be causing some issues here? Since these are all 1D jobs, could we set the utilization to .05 or something like that?

slawler commented 1 month ago

In any event, let's do some stress testing on huey and see if we can reproduce this issue.

mxkpp commented 1 month ago

In theory the official RasController COM API for 1D should use however many cores are specified in the plan file being executed as directed by the variable UNET D1 Cores. When this is 0, i.e. UNET D1 Cores= 0 then it uses all available cores.

However when experimenting with 2D automation I have seen messages in the RAS GUI indicating that it is ignoring that directive (the 2D equivalent rather) and instead is using all available cores (when executing from official RasController COM API). I personally have not tried adjusting UNET D1 Cores for the case of 1D steady state regime running programmatically (via COM API).

mxkpp commented 1 month ago

I don't think this is necessarily related to RAS cores, though it would be good to have a knob to turn for the RAS cores.

@ar-siddiqui do you have quantities you could provide for context of when/where this is happening? Machine specs, disk type (for file locks, SSD would be much better than slow HDD), num concurrent huey jobs, frequency at which you are pinging huey for status, etc?

After looking into it some and reading this discussion: https://github.com/coleifer/huey/issues/445#issuecomment-527951933

I think we should try specifying a healthy timeout here https://github.com/Dewberry/ripple/blob/cc4ca8460f2c10c6a29da4780204b9edf7b8cefb/api/tasks.py#L17

ar-siddiqui commented 1 month ago

I don't think this is necessarily related to RAS cores, though it would be good to have a knob to turn for the RAS cores.

Agreed to both points. I did see this error for tasks unrelated to HEC-RAS.

@ar-siddiqui do you have quantities you could provide for context of when/where this is happening? Machine specs, disk type (for file locks, SSD would be much better than slow HDD), num concurrent huey jobs, frequency at which you are pinging huey for status, etc?

I encountered this error for as little as 50 instantaneous job requests, I added a wait of 1 to 4 seconds to deal with it on my side.

After looking into it some and reading this discussion: coleifer/huey#445 (comment)

I think we should try specifying a healthy timeout here

https://github.com/Dewberry/ripple/blob/cc4ca8460f2c10c6a29da4780204b9edf7b8cefb/api/tasks.py#L17

mxkpp commented 1 month ago

At some point we could put a rate limiter on Flask itself, which would be safer since a SQLite lock conflict could impact any/all aspects of the huey layer (disrupting existing jobs, etc).

ar-siddiqui commented 2 weeks ago

This becoming a pain point. The API can't handle as low as 10 to 20 requests at a time.

mxkpp commented 1 week ago

@ar-siddiqui to clarify, are you referring to huey processing 10 to 20 concurrent processing jobs, or referring to flask responding to 10 to 20 concurrent requests (such as job status GET requests)? If the latter, what is the request rate?

ar-siddiqui commented 1 week ago

The error is being encountered on concurrent POST /processes/routes. The error has never been encountered on GET /jobs/ route.

I have seen the error pop up on as few as 10 to 20 concurrent POST requests.