add retries to api_get_status.rb

brianlball commented 2 years ago

there can be 503 service unavailable errors when the web container gets slammed. [1] "run command: ruby /opt/openstudio/R/lib/api_create_datapoint.rb -h http://web:80 -a d7fa128a-2c5b-4ce5-a601-8942f82427f6 -v -26.6666666666667,0 --submit" [1] "Check Run Flag run command: ruby /opt/openstudio/R/lib/api_get_status.rb -h http://web:80 -a d7fa128a-2c5b-4ce5-a601-8942f82427f6" [1] "Check Run Flag z: {:submit_simulation=>false, :sleep_time=>5, :host=>\"http://web:80\", :analysis_id=>\"d7fa128a-2c5b-4ce5-a601-8942f82427f6\"}" [2] "Check Run Flag z: /opt/openstudio/R/lib/api_get_status.rb Error: 503 Service Unavailable:/usr/local/lib/ruby/gems/2.7.0/gems/rest-client-2.0.2/lib/restclient/abstract_response.rb:223:inexception_with_response'" [3] "Check Run Flag z: /usr/local/lib/ruby/gems/2.7.0/gems/rest-client-2.0.2/lib/restclient/abstract_response.rb:103:in return!'" [4] "Check Run Flag z: /usr/local/lib/ruby/gems/2.7.0/gems/rest-client-2.0.2/lib/restclient/request.rb:809:inprocess_result'"
[5] "Check Run Flag z: /usr/local/lib/ruby/gems/2.7.0/gems/rest-client-2.0.2/lib/restclient/request.rb:725:in block in transmit'" [6] "Check Run Flag z: /usr/local/lib/ruby/2.7.0/net/http.rb:933:instart'"
[7] "Check Run Flag z: /usr/local/lib/ruby/gems/2.7.0/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in transmit'" [8] "Check Run Flag z: /usr/local/lib/ruby/gems/2.7.0/gems/rest-client-2.0.2/lib/restclient/request.rb:145:inexecute'"
[9] "Check Run Flag z: /usr/local/lib/ruby/gems/2.7.0/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in execute'" [10] "Check Run Flag z: /usr/local/lib/ruby/gems/2.7.0/gems/rest-client-2.0.2/lib/restclient.rb:67:inget'"
[11] "Check Run Flag z: /opt/openstudio/R/lib/api_get_status.rb:62:in <main>'" [12] "Check Run Flag z: {\"status\":false,\"result\":true}"

this should get retried

brianlball commented 2 years ago

tests work

[1] "Check Run Flag run command: ruby /opt/openstudio/R/lib/api_get_status.rb -h http://web:80 -a b2f6b810-15e0-4b65-96e2-069a8e555210" [1] "Check Run Flag z: {:submit_simulation=>false, :sleep_time=>5, :host=>\"http://web:80\", :analysis_id=>\"b2f6b810-15e0-4b65-96e2-069a8e555210\"}" [2] "Check Run Flag z: /opt/openstudio/R/lib/api_get_status.rb success! get_count: 1" [3] "Check Run Flag z: {\"status\":true,\"result\":true}" [1] "run_flag_json: TRUE" "run_flag_json: TRUE"

brianlball commented 2 years ago

@nllong I think this is the underlying issue https://github.com/NREL/OpenStudio-server/issues/337 as to why we are getting datapoints with NA status that dont get run.

setting max_request_queue_size (MAX_REQUESTS) to a value of 0 means that the queue is unbounded.

R cluster run calls the following files:
ruby /opt/openstudio/R/lib/api_create_datapoint.rb: which calls ruby /opt/openstudio/R/lib/api_get_status.rb (to check if analysis still has run_flag == true)

The logs above show that this can get a 503. so I think we should:

add a retry in the status check (this was a #TODO in the file). if the retry loop fails then the datapoint remains in an NA state and does not get run.
max_request_queue_size (MAX_REQUESTS) should be at least as big as the R cluster (which is defined in the OSA or here) or it could be unbounded. Setting this locally to 10 greater than the size of the R cluster, i have not seen any NA's and the number of datapoints that are run are correct.

nllong commented 2 years ago

Interesting, yeah, I would guess that the 503 is due to an unresponsive web container. I like your approach of 1, but I think the real issue is 2. The problem with just implementing 1 is that the web container is likely to be slammed for 10's of minutes if not longer, so a retry would quickly timeout too. If you can figure out how to enforce the size of the MAX REQUESTS, then that should easily fix the problem.

brianlball commented 2 years ago

If you can figure out how to enforce the size of the MAX REQUESTS, then that should easily fix the problem.

@nllong I think we can change the nginx.conf file wiht the appropriate values and then some version of >nginx -s reload

brianlball commented 2 years ago

@nllong what do you think of these proposed changes to allow passenger_max_request_queue_size and passenger_max_pool_size to be set in an OSA, since all problems are not built the same, and call nginx -s reload to reload without downtime?

brianlball commented 2 years ago

[20:08:54.810129 INFO] whoami: nginx [20:08:54.810146 INFO] test the nginx.conf file [20:08:54.817204 INFO] test_config: nginx: the configuration file /opt/nginx/conf/nginx.conf syntax is ok nginx: configuration file /opt/nginx/conf/nginx.conf test is successful [20:08:54.817237 INFO] test_config.include?('syntax is ok'): true [20:08:54.817249 INFO] test_config.include?('test is successful'): true [20:08:54.817256 INFO] get nginx processes [20:08:54.818898 INFO] nginx_pids: root 64 0.0 0.0 46612 9392 ? S 20:08 0:00 nginx: master process /opt/nginx/sbin/nginx nginx 78 0.0 0.0 47048 4560 ? S 20:08 0:00 nginx: worker process nginx 79 0.0 0.0 47048 4560 ? S 20:08 0:00 nginx: worker process nginx 80 0.0 0.0 47048 3408 ? S 20:08 0:00 nginx: worker process nginx 110 66.5 1.4 786180 234440 ? Sl 20:08 0:02 Passenger AppPreloader: /opt/openstudio/server nginx 157 47.0 1.2 786080 206296 ? Sl 20:08 0:00 Passenger AppPreloader: /opt/openstudio/server (forking...) nginx 177 0.0 0.0 4636 836 ? S 20:08 0:00 sh -c ps aux|grep nginx nginx 178 0.0 0.0 34412 2924 ? R 20:08 0:00 ps aux nginx 179 0.0 0.0 11468 1072 ? S 20:08 0:00 grep nginx [20:08:54.818950 INFO] nginx_worker_pids: ["78", "79", "80"] [20:08:54.818961 INFO] reload the nginx.conf [20:08:59.823727 INFO] count: 1 [20:08:59.823815 INFO] get nginx processes [20:08:59.826758 INFO] nginx_pids: root 64 0.0 0.0 46868 9648 ? S 20:08 0:00 nginx: master process /opt/nginx/sbin/nginx nginx 78 0.0 0.0 47048 4560 ? S 20:08 0:00 nginx: worker process is shutting down nginx 110 29.5 1.4 786180 234468 ? Sl 20:08 0:02 Passenger AppPreloader: /opt/openstudio/server nginx 157 13.4 1.2 786080 206272 ? Sl 20:08 0:00 Passenger AppPreloader: /opt/openstudio/server (forking...) nginx 199 0.0 0.0 47112 3744 ? S 20:08 0:00 nginx: worker process nginx 204 0.0 0.0 47112 3744 ? S 20:08 0:00 nginx: worker process nginx 206 0.0 0.0 47112 3744 ? S 20:08 0:00 nginx: worker process nginx 212 0.0 0.0 4636 880 ? S 20:08 0:00 sh -c ps aux|grep nginx nginx 213 0.0 0.0 34412 2980 ? R 20:08 0:00 ps aux nginx 214 0.0 0.0 11468 1040 ? S 20:08 0:00 grep nginx [20:08:59.826823 INFO] nginx_worker_pids2: ["199", "204", "206"] [20:08:59.826861 INFO] reload nginx.conf success

tijcolem commented 2 years ago

@brianlball I think the plan was to not dynamically change the nginx state and rather just configure nginx to handle the expected peak load that rserve needs? If so, can we close this and open a new issue for that?

tijcolem commented 2 years ago

@brianlball Should we close this as the plan is to use a fixed resource spec?

nllong commented 1 year ago

I think that is the plan. If we have enough resources, then this shouldn't be an issue.

brianlball commented 1 year ago

api retries implemented in https://github.com/NREL/OpenStudio-server/pull/682 closing this as we dont want to change NGINX

NREL / OpenStudio-server

add retries to api_get_status.rb #669