MultiBackendJobManager.run_jobs() doesn't add new jobs to existing job_tracker

VincentVerelst commented 7 months ago

The MultiBackendJobManager.run_jobs() method takes as input a df, which is a DataFrame containing information about all the jobs to run and an _outputfile, which contains the path to a csv file to track the status of all the jobs. If the _outputfile already exists, however, the run_jobs() method will ignore the df input and continue from the existing jobs in the _outputfile, as seen in the code below:

output_file = Path(output_file)
 if output_file.exists() and output_file.is_file():
      # Resume from existing CSV
      _log.info(f"Resuming `run_jobs` from {output_file.absolute()}")
      df = pd.read_csv(output_file)
      status_histogram = df.groupby("status").size().to_dict()
      _log.info(f"Status histogram: {status_histogram}")

This makes it so that once a MultiBackendJobManager is run a second time, with the same _outputfile, it's not possible to add new jobs. Is is possible that when _outputfile already exists, run_jobs() creates the union of the input df and existing _outputfile? Or is there a good reason not to?

soxofaan commented 7 months ago

I haven't played a lot with MultiBackendJobManager myself and don't know the practical use details to be honest.

jdries commented 7 months ago

@VincentVerelst this is certainly a possibility. I suggest that data engineering is free to extend this job manager as needed. Main reason not to do it would be to avoid unexpected behaviour, you really don't want your job csv to get corrupted and loose all info. What I sometimes did in the past is using a separate script to make necessary updates to the csv, while job manager script is stopped, and then, after verification of csv, restart job manager with updated job list.

Open-EO / openeo-python-client

MultiBackendJobManager.run_jobs() doesn't add new jobs to existing job_tracker #558