Open kvaps opened 5 years ago
The similar problem was described here: https://github.com/OpenNebula/one/issues/1702
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. The OpenNebula Dev Team
This issue has been automatically closed due to lack of activity/feedback. Please reopen if you have further input or need to bump this. The OpenNebula Dev Team
BTW, I've just started using linstor_un driver instead of LVM, and everything started working as it should. Linstor is having own timeouts for LVM operations.
this is a sound suggestion, reopening it
@paczerny Verify this is still a problem with the new monitoring system
The issue is still there. Looking into linstor driver, it use a nice command timeout -10 monitor...
, I suggest to use the same approach
The OpenNebula TM drivers use methods monitor_and_log
and ssh_monitor_and_log
. We should create duplicates of this methods with timeout parameter and prepend the timeout command, similar for the ssh_
version
The drivers then can set individual timeout.
For the LVM it could be solved by LVM refactor issue #5911
Side note: The methods have some log_error
calls, but the error doesn't appear in the oned.log
, it may be caused by redirection of stderr to stdout
Description When node have problem with stucked operations, it can brake OpenNebula itself, eg it may be broken disk subsutem, disconneted target or some other problem. OpenNebula runs a lot of
/var/lib/one/remotes/tm/<driver>/monitor
operations but they are stuck forever.To Reproduce Eg right now I have broken LUN and any lvm command is stuck for ages. Try to reproduce that:
Now you have broken host, and any lvm command will stuck forever. Wait for a while, then check
ps aux
on the opennebula you will se a lots of hangedmonitor
comandsExpected behavior OpenNebula will return ERROR on this host monitoring and continue monitoring of the rest hosts.
Details
Additional context Add any other context about the problem here.
Progress Status