askap-vast / vast-pipeline

This repository holds the code of the Radio Transient detection pipeline for the VAST project.
https://vast-survey.org/vast-pipeline/
MIT License
8 stars 3 forks source link

Web interface cannot recover from errors #575

Closed ddobie closed 3 years ago

ddobie commented 3 years ago

If an error is encountered when submitting pipeline jobs via the web interface the job appears to be permanently broken. Any attempts to reprocess the run lead to it sitting with Queued status indefinitely.

e.g. https://dev.pipeline.vast-survey.org/piperuns/12/

ddobie commented 3 years ago

Tested in real-time using this run: https://dev.pipeline.vast-survey.org/piperuns/17/

djangoQ tables show no jobs hanging in the queue list.

Processing seems to get stuck at the very first stage of processing, i.e. here: 2021-09-08 11:10:11,493 runpipeline DEBUG Copying temp config file.

Need to check qcluster output

ajstewart commented 3 years ago

I could not replicate this locally. Given that job had never run successfully when it is re-run it should automatically turn on the full re-run mode, which prints out a logger message after the above line. This is the only logging message that should happen afterwards, which means something is hanging in this bit of code, that's my best guess. But yeah it might show in the actual qcluster output. https://github.com/askap-vast/vast-pipeline/blob/7bab8fd7d854560c4e9da7d36d4d07d6134ac164/vast_pipeline/management/commands/runpipeline.py#L97-L192

A working log file would look like:

2021-09-08 10:54:05,530 runpipeline DEBUG Copying temp config file.
2021-09-08 10:54:05,593 runpipeline INFO Cleaning up pipeline run before re-process data
2021-09-08 10:54:05,612 runpipeline INFO Cleaning up forced measurements before re-process data
ajstewart commented 3 years ago

Managed to reproduce locally after all, the real error is:

11:53:47 [Q] INFO Process-1:2 processing [test-data-ui]
11:53:48 [Q] ERROR Failed [test-data-ui] - [Errno 2] No such file or directory: '/Users/adam/GitHub/vast-pipeline/pipeline-runs/test-data-ui/config_prev.yaml' : Traceback (most recent call last):
  File "/Users/adam/anaconda3/envs/vast-pipeline-dev/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
    res = f(*task["args"], **task["kwargs"])
  File "/Users/adam/GitHub/vast-pipeline/vast_pipeline/management/commands/runpipeline.py", line 220, in run_pipe
    config_diff = pipeline.config.check_prev_config_diff()
  File "/Users/adam/GitHub/vast-pipeline/vast_pipeline/pipeline/config.py", line 522, in check_prev_config_diff
    label="previous run config",
  File "/Users/adam/GitHub/vast-pipeline/vast_pipeline/pipeline/config.py", line 254, in from_file
    with open(yaml_path) as fh:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/adam/GitHub/vast-pipeline/pipeline-runs/test-data-ui/config_prev.yaml'

which comes about because of this bit below which is missing the same check but for a UI run. https://github.com/askap-vast/vast-pipeline/blob/7bab8fd7d854560c4e9da7d36d4d07d6134ac164/vast_pipeline/management/commands/runpipeline.py#L151-L158