askap-vast / vast-pipeline

This repository holds the code of the Radio Transient detection pipeline for the VAST project.
https://vast-survey.org/vast-pipeline/
MIT License
8 stars 3 forks source link

New input files are not detected when run config uses globs #645

Open marxide opened 2 years ago

marxide commented 2 years ago

When a user wishes to add images to an existing pipeline run, they modify the config to include the new inputs and relaunch the run. A check is performed to ensure that:

  1. New inputs have been added to the config, and
  2. No other settings have been changed.

Both of these conditions must be true for a pipeline run to be re-run in "add mode". The pipeline checks if the inputs have changed by reading the previous config file config_prev.yml and comparing it with the updated config.yml file. Both config files are parsed, validated, and all glob expressions are resolved.

Suppose that the config inputs are a simple glob expression, e.g.

inputs:
  image:
    glob: /data/vast-survey/VAST/release/EPOCH*/COMBINED/STOKESI_IMAGES/*.fits

If new files that match this expression are added to the filesystem, the pipeline will fail to detect that the inputs have changed. It will read both config_prev.yml and config.yml, which would contain the same glob expression in this case, and compare them. Since the globs are resolved when the config file is read, both config files will end up with the same list of inputs even though new files matching the glob were added since the run was executed.

The problem is that the config diff check only parses the previous config file and doesn't look at which images were actually used.

A potential solution would be to add a comparison of the number of resolved inputs in config.yml with the number of images stored in the Run object (i.e. Run.n_images) to the config diff check. If the number of inputs is greater than the number of images in the run object, then the run should be re-run in add mode. This won't work if images were removed, but that isn't allowed for "add mode" anyway.

marxide commented 2 years ago

By the way, the context of this issue is that I found 15 low-band images that weren't included in the combined run. The inputs are specified with a glob expression per epoch, e.g.

inputs:
  image:
    epoch00:
      glob: /data/vast-survey/VAST/release/EPOCH00/COMBINED/STOKESI_IMAGES/*.fits
    epoch01:
      glob: /data/vast-survey/VAST/release/EPOCH01/COMBINED/STOKESI_IMAGES/*.fits
    ...

I don't think there's a way I can add the new images to this config without fixing the config diff check. If I add the new files to the config explicitly, they'll show up twice when the globs are resolved.