ai2cm / fv3net

explore the FV3 data for parameterization
MIT License
16 stars 3 forks source link

Digging ourselves out of the snakepit #42

Closed nbren12 closed 4 years ago

nbren12 commented 4 years ago

Dear pipeline team (@spencerkclark @AnnaKwa @frodre @brianhenn )

I just wanted to point out that we got a ton done this week (see #41 , #35 , #38 , #29, #30, #33 and more)! Thanks so much for all your efforts. I think we can all get started doing ML soon.

This week has been a very eye-opening experience for me playing around with multiple cloud technologies and coordinating with nearly everyone on the team trying to scale this pipeline in the cloud. In particular, it has highlighted some problems with snakemake.

Basically, snakemake does not seem to work well for our workflow:

I personally think we should try to refactor our pipeline out of snakemake entirely for a more cloud friendly alternative. What do you think?

spencerkclark commented 4 years ago

I totally support looking for better alternatives 👍. I think next week it might be a good idea to sit down as a group and discuss what we've learned from this initial progress on the workflow and how we can improve on it going forward (both in the cloud, and in terms of the format/structure of the data that I provide from GFDL).

The initial deadline imposed by Gaea's downtime made it difficult to optimize the format/structure of the data I transferred. E.g. going forward would it help if the data for each category of restart files lived in a single zarr store on the cloud that contained all the timesteps?

AnnaKwa commented 4 years ago

I agree as well. I also +1 Spencer's suggestion to schedule time to talk together about how to plan this out and maybe come up with some preliminary design doc.

nbren12 commented 4 years ago

Sounds good. Let's get these PRs merged and sweep up the wreckage of this week, and then we can debrief.

going forward would it help if the data for each category of restart files lived in a single zarr store on the cloud that contained all the timesteps?

I think the current format is actually more convenient for our purposes since it is easy to use to restart the model, and we have written a bunch of code at this point. The only difference is uploading untared data would be a bit more convenient.