Kitware / UPennContrast

UPenn ?
https://upenn-contrast.netlify.com/
Apache License 2.0
8 stars 6 forks source link

Stop current worker #411

Open arjunrajlab opened 1 year ago

arjunrajlab commented 1 year ago

People in lab have noted that they want an option to stop a worker (annotation or property, but mostly annotation). They sometimes start a large job and realize it's doing the wrong thing and want to stop it, but there's no way to do that currently. Can we incorporate a stop button that would kill the process? I think a UI which just gives the option to stop right on the "Compute" button would be sufficient.

arjunrajlab commented 5 months ago

@bruyeret Someone in lab was asking about this again. I just played with it and I can't seem to get the "X compute" button to cancel something to show up.

bruyeret commented 5 months ago

Yes this issue has not been solved I made a branch cancel-workers 6 months ago but I had an issue that I discussed it with David I just rebased this branch on master (it was 154 commits late) We can resume our discussion here @manthey

Here is what I had last time:


You can checkout to this branch, open a dataset and create an annotation worker, for example the random square one

In the browser, I get an error from girder for the request PUT on the endpoint /job/${jobId}/cancel:

[2024-04-11 10:14:27,239] ERROR: 500 Error
Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 472, in _reraise_as_library_errors
    yield
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 459, in _ensure_connection
    return retry_over_time(
           ^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/utils/functional.py", line 318, in retry_over_time
    return fun(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 934, in _connection_factory
    self._connection = self._establish_connection()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 860, in _establish_connection
    conn = self.transport.establish_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/transport/pyamqp.py", line 203, in establish_connection
    conn.connect()
  File "/venv/lib/python3.11/site-packages/amqp/connection.py", line 324, in connect
    self.transport.connect()
  File "/venv/lib/python3.11/site-packages/amqp/transport.py", line 129, in connect
    self._connect(self.host, self.port, self.connect_timeout)
  File "/venv/lib/python3.11/site-packages/amqp/transport.py", line 184, in _connect
    self.sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 655, in endpointDecorator
    val = fun(self, path, params)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 1251, in PUT
    return self.handleRoute('PUT', path, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 983, in handleRoute
    val = handler(**kwargs)
          ^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/access.py", line 56, in wrapped
    return fun(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 436, in wrapped
    val = fun(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/describe.py", line 736, in wrapped
    return fun(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder_jobs/job_rest.py", line 203, in cancelJob
    return self._model.cancelJob(job)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder_jobs/models/job.py", line 145, in cancelJob
    event = events.trigger('jobs.cancel', info=job)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/events.py", line 291, in trigger
    handler(e)
  File "/venv/lib/python3.11/site-packages/girder_worker/girder_plugin/event_handlers.py", line 141, in cancel
    asyncResult.revoke()
  File "/venv/lib/python3.11/site-packages/celery/result.py", line 160, in revoke
    self.app.control.revoke(self.id, connection=connection,
  File "/venv/lib/python3.11/site-packages/celery/app/control.py", line 496, in revoke
    return self.broadcast('revoke', destination=destination, arguments={
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/celery/app/control.py", line 776, in broadcast
    return self.mailbox(conn)._broadcast(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/pidbox.py", line 330, in _broadcast
    chan = channel or self.connection.default_channel
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 953, in default_channel
    self._ensure_connection(**conn_opts)
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 458, in _ensure_connection
    with ctx():
  File "/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 476, in _reraise_as_library_errors
    raise ConnectionError(str(exc)) from exc
kombu.exceptions.OperationalError: [Errno 111] Connection refused
Additional info:
  Request URL: PUT http://localhost:8080/api/v1/job/6617b7fcc2e0ea61cecb39f5/cancel
  Query string: 
  Remote IP: 172.17.0.1
  Request UID: 9249b22f-3c15-4605-a5a2-e247b74f0e3a
arjunrajlab commented 5 months ago

@manthey I also tried this again just now. I see the same error in the Girder logs:

[2024-04-23 12:17:25,617: INFO/MainProcess] Received task: girder_worker.docker.tasks.docker_run[cb510362-e700-4e17-b006-272744c06867]  

/usr/local/lib/python3.6/dist-packages/celery/platforms.py:801: RuntimeWarning: You're running the worker with superuser privileges: this is

absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  uid=uid, euid=euid, gid=gid, egid=egid,

[2024-04-23 12:17:30,459: WARNING/ForkPoolWorker-16] creating new log file

2024-04-23 12:17:30,452 [INFO] WRITING LOG OUTPUT TO /root/.cellpose/run.log

2024-04-23 12:17:30,452 [INFO] 

cellpose version:   2.2.3 

platform:           linux 

python version:     3.10.12 

torch version:      2.1.0+cu121

[2024-04-23 12:17:30,466: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,465 [INFO] TORCH CUDA version not installed/working.

2024-04-23 12:17:30,465 [INFO] >>>> using CPU

[2024-04-23 12:17:30,470: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,466 [INFO] >> nuclei << model set to be used

[2024-04-23 12:17:30,580: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,580 [INFO] >>>> model diam_mean =  17.000 (ROIs rescaled to this size during training)

[2024-04-23 12:17:30,729: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,728 [INFO] ~~~ FINDING MASKS ~~~

[2024-04-23 12:17:34,042: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:34,036 [INFO] >>>> TOTAL TIME 3.31 sec

[2024-04-23 12:17:35,364: WARNING/ForkPoolWorker-16] Uploading 513 annotations

progress=1 title=Running Cellpose info=1/1

[2024-04-23 12:17:36,131: WARNING/ForkPoolWorker-16] Invalid state transition to '3', Current state is '824'.

[2024-04-23 12:17:36,140: INFO/ForkPoolWorker-16] Task girder_worker.docker.tasks.docker_run[cb510362-e700-4e17-b006-272744c06867] succeeded in 10.495001778006554s: None

Seems like the same Invalid state transition to '3', Current state is '824'. issue. See also above from @bruyeret for more context on the PUT request. Not sure why this is not reproducing on your setup. I am doing this using localhost:5173 for the server and localhost:8080 for the girder domain, but we have also noticed the error in a number of other setups as well.

arjunrajlab commented 5 months ago

@manthey I could also get this up on AWS if you want to give it a try there.

arjunrajlab commented 5 months ago

Update: @manthey has now been able to see the problem and is trying to get to the bottom of it.

bruyeret commented 4 months ago

I fixed the endpoints for uploading annotations that we suspected to be the cause of the issue. I made a worker upload 3000 annotations and sleep 2 seconds every 100 annotations (so that it takes a total of 60s) It works as expected and uploads everything in 1min But when I try to cancel I get a 500 error from girder and in the logs I see the same error as above What do you think @manthey?