Open arjunrajlab opened 1 year ago
@bruyeret Someone in lab was asking about this again. I just played with it and I can't seem to get the "X compute" button to cancel something to show up.
Yes this issue has not been solved
I made a branch cancel-workers
6 months ago but I had an issue that I discussed it with David
I just rebased this branch on master (it was 154 commits late)
We can resume our discussion here @manthey
Here is what I had last time:
You can checkout to this branch, open a dataset and create an annotation worker, for example the random square one
If I open the the worker and choose to create 1000 annotations, it works perfectly fine: The worker creates the annotations and the front end downloads the new annotations once the worker is done
If during the computation I click cancel, there are some issues: The worker keeps going and computes all the annotations, I get an error 500 from girder even if girder says that the job is cancelled The output of the worker is the following:
Executed the code in: 6.64365798100016 seconds
Invalid state transition to '3', Current state is '824'.
State 3 is success and 824 is cancelling
In the browser, I get an error from girder for the request PUT
on the endpoint /job/${jobId}/cancel
:
[2024-04-11 10:14:27,239] ERROR: 500 Error
Traceback (most recent call last):
File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 472, in _reraise_as_library_errors
yield
File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 459, in _ensure_connection
return retry_over_time(
^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/kombu/utils/functional.py", line 318, in retry_over_time
return fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 934, in _connection_factory
self._connection = self._establish_connection()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 860, in _establish_connection
conn = self.transport.establish_connection()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/kombu/transport/pyamqp.py", line 203, in establish_connection
conn.connect()
File "/venv/lib/python3.11/site-packages/amqp/connection.py", line 324, in connect
self.transport.connect()
File "/venv/lib/python3.11/site-packages/amqp/transport.py", line 129, in connect
self._connect(self.host, self.port, self.connect_timeout)
File "/venv/lib/python3.11/site-packages/amqp/transport.py", line 184, in _connect
self.sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 655, in endpointDecorator
val = fun(self, path, params)
^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 1251, in PUT
return self.handleRoute('PUT', path, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 983, in handleRoute
val = handler(**kwargs)
^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/girder/api/access.py", line 56, in wrapped
return fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 436, in wrapped
val = fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/girder/api/describe.py", line 736, in wrapped
return fun(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/girder_jobs/job_rest.py", line 203, in cancelJob
return self._model.cancelJob(job)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/girder_jobs/models/job.py", line 145, in cancelJob
event = events.trigger('jobs.cancel', info=job)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/girder/events.py", line 291, in trigger
handler(e)
File "/venv/lib/python3.11/site-packages/girder_worker/girder_plugin/event_handlers.py", line 141, in cancel
asyncResult.revoke()
File "/venv/lib/python3.11/site-packages/celery/result.py", line 160, in revoke
self.app.control.revoke(self.id, connection=connection,
File "/venv/lib/python3.11/site-packages/celery/app/control.py", line 496, in revoke
return self.broadcast('revoke', destination=destination, arguments={
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/celery/app/control.py", line 776, in broadcast
return self.mailbox(conn)._broadcast(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/kombu/pidbox.py", line 330, in _broadcast
chan = channel or self.connection.default_channel
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 953, in default_channel
self._ensure_connection(**conn_opts)
File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 458, in _ensure_connection
with ctx():
File "/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 476, in _reraise_as_library_errors
raise ConnectionError(str(exc)) from exc
kombu.exceptions.OperationalError: [Errno 111] Connection refused
Additional info:
Request URL: PUT http://localhost:8080/api/v1/job/6617b7fcc2e0ea61cecb39f5/cancel
Query string:
Remote IP: 172.17.0.1
Request UID: 9249b22f-3c15-4605-a5a2-e247b74f0e3a
@manthey I also tried this again just now. I see the same error in the Girder logs:
[2024-04-23 12:17:25,617: INFO/MainProcess] Received task: girder_worker.docker.tasks.docker_run[cb510362-e700-4e17-b006-272744c06867]
/usr/local/lib/python3.6/dist-packages/celery/platforms.py:801: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!
Please specify a different user using the --uid option.
User information: uid=0 euid=0 gid=0 egid=0
uid=uid, euid=euid, gid=gid, egid=egid,
[2024-04-23 12:17:30,459: WARNING/ForkPoolWorker-16] creating new log file
2024-04-23 12:17:30,452 [INFO] WRITING LOG OUTPUT TO /root/.cellpose/run.log
2024-04-23 12:17:30,452 [INFO]
cellpose version: 2.2.3
platform: linux
python version: 3.10.12
torch version: 2.1.0+cu121
[2024-04-23 12:17:30,466: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,465 [INFO] TORCH CUDA version not installed/working.
2024-04-23 12:17:30,465 [INFO] >>>> using CPU
[2024-04-23 12:17:30,470: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,466 [INFO] >> nuclei << model set to be used
[2024-04-23 12:17:30,580: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,580 [INFO] >>>> model diam_mean = 17.000 (ROIs rescaled to this size during training)
[2024-04-23 12:17:30,729: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,728 [INFO] ~~~ FINDING MASKS ~~~
[2024-04-23 12:17:34,042: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:34,036 [INFO] >>>> TOTAL TIME 3.31 sec
[2024-04-23 12:17:35,364: WARNING/ForkPoolWorker-16] Uploading 513 annotations
progress=1 title=Running Cellpose info=1/1
[2024-04-23 12:17:36,131: WARNING/ForkPoolWorker-16] Invalid state transition to '3', Current state is '824'.
[2024-04-23 12:17:36,140: INFO/ForkPoolWorker-16] Task girder_worker.docker.tasks.docker_run[cb510362-e700-4e17-b006-272744c06867] succeeded in 10.495001778006554s: None
Seems like the same Invalid state transition to '3', Current state is '824'.
issue. See also above from @bruyeret for more context on the PUT request. Not sure why this is not reproducing on your setup. I am doing this using localhost:5173 for the server and localhost:8080 for the girder domain, but we have also noticed the error in a number of other setups as well.
@manthey I could also get this up on AWS if you want to give it a try there.
Update: @manthey has now been able to see the problem and is trying to get to the bottom of it.
I fixed the endpoints for uploading annotations that we suspected to be the cause of the issue. I made a worker upload 3000 annotations and sleep 2 seconds every 100 annotations (so that it takes a total of 60s) It works as expected and uploads everything in 1min But when I try to cancel I get a 500 error from girder and in the logs I see the same error as above What do you think @manthey?
People in lab have noted that they want an option to stop a worker (annotation or property, but mostly annotation). They sometimes start a large job and realize it's doing the wrong thing and want to stop it, but there's no way to do that currently. Can we incorporate a stop button that would kill the process? I think a UI which just gives the option to stop right on the "Compute" button would be sufficient.