Digging ourselves out of the snakepit

nbren12 commented 4 years ago

Dear pipeline team (@spencerkclark @AnnaKwa @frodre @brianhenn )

I just wanted to point out that we got a ton done this week (see #41 , #35 , #38 , #29, #30, #33 and more)! Thanks so much for all your efforts. I think we can all get started doing ML soon.

This week has been a very eye-opening experience for me playing around with multiple cloud technologies and coordinating with nearly everyone on the team trying to scale this pipeline in the cloud. In particular, it has highlighted some problems with snakemake.

Basically, snakemake does not seem to work well for our workflow:

It does not play well with cloud storage. The recommended solution for remote files (GS.remote) is honestly pretty awful. Even if files are reused between several rules, it seems to re-download the data by default. More seriously, it floods the GCS API with tons of GET requests, slowing down the dag construction and will potentially lead to access denied errors.
There is a ton of overhead in specifying the filenames needed for each rule. Our pipeline is actually quite simple. It looks like this:
```
extract tar -> coarse-grain surface and other fields -> create merged rundirs -> run fv3
```
However, each of these operations produces many files, and snakemakes workflow is a dependency graph of files. Therefore, instead of saying "coarsegraining depends on extraction", we have to specify "these million files produced by coarse-graining depend on these other million files produced by extraction". Indeed, most of code in the Snakefile is devoted to building all of these file names, and using them to wire the rules together.

I personally think we should try to refactor our pipeline out of snakemake entirely for a more cloud friendly alternative. What do you think?

spencerkclark commented 4 years ago

I totally support looking for better alternatives 👍. I think next week it might be a good idea to sit down as a group and discuss what we've learned from this initial progress on the workflow and how we can improve on it going forward (both in the cloud, and in terms of the format/structure of the data that I provide from GFDL).

The initial deadline imposed by Gaea's downtime made it difficult to optimize the format/structure of the data I transferred. E.g. going forward would it help if the data for each category of restart files lived in a single zarr store on the cloud that contained all the timesteps?

AnnaKwa commented 4 years ago

I agree as well. I also +1 Spencer's suggestion to schedule time to talk together about how to plan this out and maybe come up with some preliminary design doc.

nbren12 commented 4 years ago

Sounds good. Let's get these PRs merged and sweep up the wreckage of this week, and then we can debrief.

going forward would it help if the data for each category of restart files lived in a single zarr store on the cloud that contained all the timesteps?

I think the current format is actually more convenient for our purposes since it is easy to use to restart the model, and we have written a bunch of code at this point. The only difference is uploading untared data would be a bit more convenient.

ai2cm / fv3net

Digging ourselves out of the snakepit #42