How to solve the service process stuck in the "running" status

victor051 commented 5 years ago

Service process's status stuck in "running" for some unknown reason. Is there a way to stop it?Is it possible to add a "stop" button to kill it?

afourmy commented 5 years ago

In the "Admin / Advanced" webpage, there is a "Reset Service / Workflow statuses." button, just click on it and all statuses will be resetted. BUT if you end up in that situation, it means something wrong happen, and you should check in the logs for an exception. Can you do that and paste it here ? One reason I've seen it happen is when you're running a service with multiprocessing enabled, but your database (SQLite) does not support concurrency, so you get a database is locked exception. If that is your case, there are two solutions:

Use SQLite in WAL journal mode, so that it supports concurrency
Use a PostgreSQL database

victor051 commented 5 years ago

I reproduced this problem and got the log. I found that one of the devices used the wrong driver and caused an error.But I think that the process should not be stuck. ===============================logs=================================

04-12-2019 14:20:18 ERROR Job "scheduler_job (trigger: date[2019-04-12 14:19:45 CST], next run at: 2019-04-12 14:19:45 CST)" raised an exception Traceback (most recent call last): File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1224, in _execute_context cursor, statement, parameters, context File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 752, in do_executemany cursor.executemany(statement, parameters) psycopg2.errors.DeadlockDetected: deadlock detected DETAIL: Process 6924 waits for ShareLock on transaction 2701; blocked by process 6923. Process 6923 waits for ShareLock on transaction 2703; blocked by process 6929. Process 6929 waits for ExclusiveLock on tuple (1,17) of relation 16497 of database 16384; blocked by process 6924. HINT: See server log for query details. CONTEXT: while updating tuple (1,17) in relation "Device"

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/python3/lib/python3.6/site-packages/apscheduler/executors/base.py", line 125, in run_job retval = job.func(*job.args, *job.kwargs) File "/home/nemo/eNMS/eNMS/automation/functions.py", line 28, in scheduler_job results, now = job.try_run(targets=targets, payload=payload) File "/home/nemo/eNMS/eNMS/automation/models.py", line 175, in try_run attempt = self.run(payload, job_from_workflow_targets, targets, workflow) File "/home/nemo/eNMS/eNMS/automation/models.py", line 270, in run [(device, results, payload, workflow) for device in targets], File "/usr/local/python3/lib/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/usr/local/python3/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value File "/usr/local/python3/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(args, *kwds)) File "/usr/local/python3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/home/nemo/eNMS/eNMS/automation/models.py", line 249, in device_run device_result = self.get_results(payload, device, workflow) File "/home/nemo/eNMS/eNMS/automation/models.py", line 244, in get_results return results File "/usr/local/python3/lib/python3.6/contextlib.py", line 88, in exit next(self.gen) File "/home/nemo/eNMS/eNMS/functions.py", line 206, in session_scope raise e File "/home/nemo/eNMS/eNMS/functions.py", line 202, in session_scope session.commit() File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 1026, in commit self.transaction.commit() File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 493, in commit self._prepare_impl() File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 472, in _prepare_impl self.session.flush() File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2451, in flush self._flush(objects) File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2589, in _flush transaction.rollback(_capture_exception=True) File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in exit compat.reraise(exc_type, exc_value, exc_tb) File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise raise value File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 2549, in _flush flush_context.execute() File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute rec.execute(self) File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute uow, File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 236, in save_obj update, File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 978, in _emit_update_statements statement, multiparams File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 988, in execute return meth(self, multiparams, params) File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement distilled_params, File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context e, statement, parameters, cursor, context File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception util.raise_from_cause(sqlalchemy_exception, exc_info) File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause reraise(type(exception), exception, tb=exc_tb, cause=cause) File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise raise value.with_traceback(tb) File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1224, in _execute_context cursor, statement, parameters, context File "/usr/local/python3/lib/python3.6/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 752, in do_executemany cursor.executemany(statement, parameters) sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected DETAIL: Process 6924 waits for ShareLock on transaction 2701; blocked by process 6923. Process 6923 waits for ShareLock on transaction 2703; blocked by process 6929. Process 6929 waits for ExclusiveLock on tuple (1,17) of relation 16497 of database 16384; blocked by process 6924. HINT: See server log for query details. CONTEXT: while updating tuple (1,17) in relation "Device"

[SQL: UPDATE "Device" SET last_runtime=%(last_runtime)s WHERE "Device".id = %(Device_id)s] [parameters: ({'last_runtime': 16.203381, 'Device_id': 141}, {'last_runtime': 16.100812, 'Device_id': 142}, {'last_runtime': 16.167533, 'Device_id': 143}, {'last_runtime': 16.269567, 'Device_id': 144}, {'last_runtime': 16.398398, 'Device_id': 145}, {'last_runtime': 16.662657, 'Device_id': 146}, {'last_runtime': 16.210814, 'Device_id': 147}, {'last_runtime': 16.141901, 'Device_id': 148}, {'last_runtime': 16.15689, 'Device_id': 149}, {'last_runtime': 16.127528, 'Device_id': 150})] (Background on this error at: http://sqlalche.me/e/e3q8)

afourmy commented 5 years ago

What service are you using ? On what devices ?
How many targets ? How many processes ?

It depends a lot on what you're doing and your environment, I can't fix it if I cannot reproduce it.

victor051 commented 5 years ago

What service are you using ? On what devices ? NetmikoBackupService on H3C's switch( netmiko driver:hp_comware.) An error occurred when setting one of switch to hp_procurve

How many targets ? How many processes ? 12 switches & 50 processes

afourmy commented 5 years ago

Can't reproduce it, but if you have only 12 switches, you don't need to enable multiprocessing

victor051 commented 5 years ago

Got it, I will try to solve this bug.Thank you.

afourmy commented 5 years ago

This can no longer happen in eNMS 3.15, services and workflows don't have a "status" anymore.

eNMS-automation / eNMS

How to solve the service process stuck in the "running" status #134