Open ar-siddiqui opened 1 month ago
@mxkpp do we have a setting / parameter for HEC-RAS core utilization? I wonder if it is greedy and that could be causing some issues here? Since these are all 1D jobs, could we set the utilization to .05 or something like that?
In any event, let's do some stress testing on huey and see if we can reproduce this issue.
In theory the official RasController COM API for 1D should use however many cores are specified in the plan file being executed as directed by the variable UNET D1 Cores
. When this is 0, i.e. UNET D1 Cores= 0
then it uses all available cores.
However when experimenting with 2D automation I have seen messages in the RAS GUI indicating that it is ignoring that directive (the 2D equivalent rather) and instead is using all available cores (when executing from official RasController COM API). I personally have not tried adjusting UNET D1 Cores
for the case of 1D steady state regime running programmatically (via COM API).
I don't think this is necessarily related to RAS cores, though it would be good to have a knob to turn for the RAS cores.
@ar-siddiqui do you have quantities you could provide for context of when/where this is happening? Machine specs, disk type (for file locks, SSD would be much better than slow HDD), num concurrent huey jobs, frequency at which you are pinging huey for status, etc?
After looking into it some and reading this discussion: https://github.com/coleifer/huey/issues/445#issuecomment-527951933
I think we should try specifying a healthy timeout here https://github.com/Dewberry/ripple/blob/cc4ca8460f2c10c6a29da4780204b9edf7b8cefb/api/tasks.py#L17
I don't think this is necessarily related to RAS cores, though it would be good to have a knob to turn for the RAS cores.
Agreed to both points. I did see this error for tasks unrelated to HEC-RAS.
@ar-siddiqui do you have quantities you could provide for context of when/where this is happening? Machine specs, disk type (for file locks, SSD would be much better than slow HDD), num concurrent huey jobs, frequency at which you are pinging huey for status, etc?
I encountered this error for as little as 50 instantaneous job requests, I added a wait of 1 to 4 seconds to deal with it on my side.
After looking into it some and reading this discussion: coleifer/huey#445 (comment)
I think we should try specifying a healthy timeout here
https://github.com/Dewberry/ripple/blob/cc4ca8460f2c10c6a29da4780204b9edf7b8cefb/api/tasks.py#L17
At some point we could put a rate limiter on Flask itself, which would be safer since a SQLite lock conflict could impact any/all aspects of the huey layer (disrupting existing jobs, etc).
This becoming a pain point. The API can't handle as low as 10 to 20 requests at a time.
@ar-siddiqui to clarify, are you referring to huey processing 10 to 20 concurrent processing jobs, or referring to flask responding to 10 to 20 concurrent requests (such as job status GET requests)? If the latter, what is the request rate?
The error is being encountered on concurrent POST /processes/routes. The error has never been encountered on GET /jobs/ route.
I have seen the error pop up on as few as 10 to 20 concurrent POST requests.
Getting errors when launching many jobs in one go.
SQLite can have only one write connection at a time, which might be the cause of this error, there should be one connection object only and it must be passed around to different functions.