Closed minrk closed 1 month ago
Crashing (presumably out-of-memory) tends to happen at:
# Save the optimized parameters
params = optimized.to_dict()
pd.DataFrame.from_dict(params, orient="index").to_json(
f"{target_root}/parameters.json", storage_options=storage_options
)
Some tests seem to fail at:
# Verify the data
diff["diff"].count(["lat","lon"]).plot()
with server connection errors. Perhaps there's a way to make these robust to retries?
Can we merge this PR?
Yes, go ahead. Let me know if you have any questions or problems
launches jobs with https://github.com/minrk/kbatch-papermill
kbatch is both a command-line tool and Python API.
Command-line tool:
The main Python API is kbatch_papermill.kbatch_papermill, which runs a single notebook as a job storing the result in s3, returning the job id.
It currently requests 160GB for each job, which is not quite enough for every job (AD_A11382 fails). Many tags use a lot less than that, so we'd get more paralleism if we had a good heuristic for memory usage. There are at least memory profile reports in the notebooks that finish running.