galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Pulsar queued_python potentially isn't executing jobs on my machine. #375

Open hexylena opened 1 month ago

hexylena commented 1 month ago

Reported in admins matrix,

app.yml ``` --- dependency_resolution: resolvers: - auto_init: true auto_install: true type: conda job_metrics_config_file: job_metrics_conf.yml min_polling_interval: 0.5 persistence_directory: /mnt/pulsar/files/persisted_data private_token: asdf staging_directory: /mnt/pulsar/files/staging tool_dependency_dir: /mnt/pulsar/deps managers: _default_: type: queued_python num_concurrent_jobs: 1 ```

but my jobs don't execute:

Oct 09 16:51:03 worker1 uwsgi[66861]: 2024-10-09 16:51:03,445 DEBUG [pulsar.managers.base][uWSGIWorker1Core0] job_id: 16 - checking tool file cutWrapper.pl
Oct 09 16:51:03 worker1 uwsgi[66861]: 2024-10-09 16:51:03,446 DEBUG [galaxy.tool_util.deps][uWSGIWorker1Core0] Using dependency perl version 5.26 of type conda 
Oct 09 16:51:03 worker1 uwsgi[66861]: [pid: 66861|app: 0|req: 610/610] 145.38.195.22 () {32 vars in 5437 bytes} [Wed Oct  9 16:51:03 2024] POST /managers/_default_/jobs/16/submit?command_line=%2Fbin%2Fbash+%2Fm..........

Nate suggested py-spy

root@worker1:/mnt/pulsar# py-spy dump --pid 66861
Process 66861: /mnt/pulsar/venv/bin/uwsgi --ini-paste /mnt/pulsar/config/server.ini
Python v3.10.12 (/mnt/pulsar/venv/bin/uwsgi)

Thread 0x7F93FB7D1040 (active): "uWSGIWorker1Core0"
Thread 0x7F93F35FF640 (idle): "Thread-1 (run_next)"
    wait (threading.py:320)
    get (queue.py:171)
    run_next (pulsar/managers/queued.py:83)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

and I can verify that things get added to the queue, but nothing seems to be read from the queue.

pulsar-check --private_token=asdf --debug


INFO:pulsar.client.manager:Setting Pulsar client class to standard, non-caching variant.

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/t/script.py] (action_type: [transfer])

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/dataset_0.dat] (action_type: [transfer])

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/dataset_0_files/input_subdir/extra] (action_type: [transfer])

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/metadata/12312231231231.dat] (action_type: [transfer])

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/w/config.txt] (action_type: [transfer])

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/m/metadata_test123] (action_type: [transfer])

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/idx/seq/human_full_seqs] (action_type: [transfer])

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/idx/bwa/human.fa.fai] (action_type: [transfer])

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/idx/bwa/human.fa] (action_type: [transfer])

DEBUG:pulsar.client.client:Uploading path [/tmp/pulsar-check-client.dn8jifhl/w/config.txt] (action_type: [message])


swapping to queued_condor and making no other changes, enabled jobs to execute.

Running the latest pulsar:

(venv) root@worker1:/home/hrasche2# pip freeze | grep pulsar
pulsar-app==0.15.6

In this case I'd rather not install htcondor if it isn't necessary.

hexylena commented 1 day ago

@natefoo @jmchilton if y'all have ideas on this, i'd appreciate it. anything I can test? It seems like a very default configuration. do i need to be running more uwsgi processes? I would like to avoid the overhead of a real DRM for this use case, I looked into the github tests a bit but those seem to setup proper slurm rather than using queued_python, is it possible that that deployment option isn't working?