OpenNebula / one

The open source Cloud & Edge Computing Platform bringing real freedom to your Enterprise Cloud 🚀
http://opennebula.io
Apache License 2.0
1.23k stars 478 forks source link

Timeout for monitor operation #2730

Open kvaps opened 5 years ago

kvaps commented 5 years ago

Description When node have problem with stucked operations, it can brake OpenNebula itself, eg it may be broken disk subsutem, disconneted target or some other problem. OpenNebula runs a lot of /var/lib/one/remotes/tm/<driver>/monitor operations but they are stuck forever.

To Reproduce Eg right now I have broken LUN and any lvm command is stuck for ages. Try to reproduce that:

Now you have broken host, and any lvm command will stuck forever. Wait for a while, then check ps aux on the opennebula you will se a lots of hanged monitor comands

Expected behavior OpenNebula will return ERROR on this host monitoring and continue monitoring of the rest hosts.

Details

Additional context Add any other context about the problem here.

Progress Status

kvaps commented 5 years ago

The similar problem was described here: https://github.com/OpenNebula/one/issues/1702

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. The OpenNebula Dev Team

stale[bot] commented 4 years ago

This issue has been automatically closed due to lack of activity/feedback. Please reopen if you have further input or need to bump this. The OpenNebula Dev Team

kvaps commented 4 years ago

BTW, I've just started using linstor_un driver instead of LVM, and everything started working as it should. Linstor is having own timeouts for LVM operations.

tinova commented 4 years ago

this is a sound suggestion, reopening it

rsmontero commented 2 years ago

@paczerny Verify this is still a problem with the new monitoring system

paczerny commented 1 year ago

The issue is still there. Looking into linstor driver, it use a nice command timeout -10 monitor..., I suggest to use the same approach

The OpenNebula TM drivers use methods monitor_and_log and ssh_monitor_and_log. We should create duplicates of this methods with timeout parameter and prepend the timeout command, similar for the ssh_ version The drivers then can set individual timeout.

For the LVM it could be solved by LVM refactor issue #5911

Side note: The methods have some log_error calls, but the error doesn't appear in the oned.log, it may be caused by redirection of stderr to stdout