Closed nbren12 closed 4 years ago
I totally support looking for better alternatives 👍. I think next week it might be a good idea to sit down as a group and discuss what we've learned from this initial progress on the workflow and how we can improve on it going forward (both in the cloud, and in terms of the format/structure of the data that I provide from GFDL).
The initial deadline imposed by Gaea's downtime made it difficult to optimize the format/structure of the data I transferred. E.g. going forward would it help if the data for each category of restart files lived in a single zarr store on the cloud that contained all the timesteps?
I agree as well. I also +1 Spencer's suggestion to schedule time to talk together about how to plan this out and maybe come up with some preliminary design doc.
Sounds good. Let's get these PRs merged and sweep up the wreckage of this week, and then we can debrief.
going forward would it help if the data for each category of restart files lived in a single zarr store on the cloud that contained all the timesteps?
I think the current format is actually more convenient for our purposes since it is easy to use to restart the model, and we have written a bunch of code at this point. The only difference is uploading untared data would be a bit more convenient.
Dear pipeline team (@spencerkclark @AnnaKwa @frodre @brianhenn )
I just wanted to point out that we got a ton done this week (see #41 , #35 , #38 , #29, #30, #33 and more)! Thanks so much for all your efforts. I think we can all get started doing ML soon.
This week has been a very eye-opening experience for me playing around with multiple cloud technologies and coordinating with nearly everyone on the team trying to scale this pipeline in the cloud. In particular, it has highlighted some problems with
snakemake
.Basically,
snakemake
does not seem to work well for our workflow:GS.remote
) is honestly pretty awful. Even if files are reused between several rules, it seems to re-download the data by default. More seriously, it floods the GCS API with tons of GET requests, slowing down the dag construction and will potentially lead to access denied errors.There is a ton of overhead in specifying the filenames needed for each rule. Our pipeline is actually quite simple. It looks like this:
However, each of these operations produces many files, and
snakemake
s workflow is a dependency graph of files. Therefore, instead of saying "coarsegraining depends on extraction", we have to specify "these million files produced by coarse-graining depend on these other million files produced by extraction". Indeed, most of code in theSnakefile
is devoted to building all of these file names, and using them to wire the rules together.I personally think we should try to refactor our pipeline out of
snakemake
entirely for a more cloud friendly alternative. What do you think?