OpenNebula / addon-context-linux

Linux VM Contextualization
Apache License 2.0
56 stars 73 forks source link

report-ready script stuck (need to implement timeout in wget, curl and ruby) #237

Closed Th0masL closed 3 years ago

Th0masL commented 3 years ago

Hi,

I've just realized that the net-99-report-ready script does not contains a timeout value in its wget and/or curl command, making the script hang for quite some time if there's some network connectivity issue with the Onegate Endpoint.

Currently the script will use the default timeout of wget, which is 900 seconds (15 minutes), or 2 minutes for curl (depending on the operating system).

We should probably implement a timeout in the wget and curl command of the script net-99-report-ready :

It also looks like the ruby scripts that are reporting the ready state are stuck :

root@server:/home/thomas# service one-context status
● one-context.service - OpenNebula contextualization script
     Loaded: loaded (/lib/systemd/system/one-context.service; enabled; vendor preset: enabled)
     Active: activating (start) since Wed 2021-07-14 10:18:47 UTC; 45s ago
   Main PID: 1398 (bash)
      Tasks: 6 (limit: 9448)
     Memory: 43.9M
     CGroup: /system.slice/one-context.service
             ├─1398 bash /usr/sbin/one-contextd network
             ├─1515 bash /usr/sbin/one-contextd network
             ├─1516 bash /etc/one-context.d/net-99-report-ready # waiting for the child process below
             ├─1594 bash /usr/bin/onegate vm update --data READY=YES # waiting for the child process below
             └─1597 ruby /usr/bin/onegate.rb vm update --data READY=YES # <------- stuck

I have updated the net-99-report-ready script to implement timeouts on the curl and wget commands, and that's preventing the net-99-report-ready script to remain stuck on the wget or curl step, but it remains stuck on the ruby step.

Parameter for curl : --max-time 5 Parameter for wget : --timeout=5

Here is the overview of using the timeout parameter with curl and/or wget (as you can see it's giving up quite quickly) :

Jul 14 09:52:33 server one-contextd[1513]: Script net-99-report-ready: Starting ...
Jul 14 09:53:42 server one-contextd[1643]: Script net-99-report-ready output:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Jul 14 09:53:42 server one-contextd[1643]:                                  Dload  Upload   Total   Spent    Left  Speed
Jul 14 09:53:42 server one-contextd[1643]: #015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0
Jul 14 09:53:42 server one-contextd[1643]: curl: (28) Connection timed out after 5002 milliseconds
Jul 14 09:53:42 server one-contextd[1643]: --2021-07-14 09:52:38--  http://172.16.100.1:5030/vm
Jul 14 09:53:42 server one-contextd[1643]: Connecting to 172.16.100.1:5030... failed: Connection timed out.
Jul 14 09:53:42 server one-contextd[1643]: Giving up.
Jul 14 09:53:42 server one-contextd[1643]: ERROR:
Jul 14 09:53:42 server one-contextd[1643]: Error timeout while connected to server (execution expired).
Jul 14 09:53:42 server one-contextd[1643]: Server: 172.16.100.1:5030
Jul 14 09:53:42 server one-contextd[1644]: Script net-99-report-ready: Finished with exit code 0

Also, I believe that the current script is always going to return an exit code of 0, so maybe we should update the logic to show exit codes different than 0 in case of errors ?

Thomas

vholer commented 3 years ago

Thank you for report, timeouts were added. Checked the exit code respects the state of opration.