update papermill_launcher to use kbatch

minrk commented 1 month ago

launches jobs with https://github.com/minrk/kbatch-papermill

kbatch is both a command-line tool and Python API.

Command-line tool:

kbatch job list # lists jobs
kbatch job logs [job] # logs one job

The main Python API is kbatch_papermill.kbatch_papermill, which runs a single notebook as a job storing the result in s3, returning the job id.

It currently requests 160GB for each job, which is not quite enough for every job (AD_A11382 fails). Many tags use a lot less than that, so we'd get more paralleism if we had a good heuristic for memory usage. There are at least memory profile reports in the notebooks that finish running.

minrk commented 1 month ago

Crashing (presumably out-of-memory) tends to happen at:

# Save the optimized parameters
params = optimized.to_dict()
pd.DataFrame.from_dict(params, orient="index").to_json(
    f"{target_root}/parameters.json", storage_options=storage_options
)

Some tests seem to fail at:

# Verify the data
diff["diff"].count(["lat","lon"]).plot()

with server connection errors. Perhaps there's a way to make these robust to retries?

annefou commented 1 month ago

Can we merge this PR?

minrk commented 1 month ago

Yes, go ahead. Let me know if you have any questions or problems

destination-earth / DestinE_ESA_GFTS

update papermill_launcher to use kbatch #91