COSIMA / cosima-recipes

A cookbook of recipes (i.e., examples) for analysing ocean and sea ice model output
https://cosima-recipes.readthedocs.io
Apache License 2.0
46 stars 66 forks source link

New chunking tutorial #442

Open adele-morrison opened 3 months ago

adele-morrison commented 3 months ago

In @Thomas-Moore-Creative's COSIMA talk, there was a lot of interest in having a tutorial in the Recipes showing best practice for chunking for different types of problems.

e.g. which dimensions to chunk in depending on the problem, how big should chunks be, etc.

This might be a good starting point to base this on: https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/chunking.html

Thomas-Moore-Creative commented 3 months ago

@adele-morrison - this is a really good opportunity to talk about community and others contributions. It's a rather circular path of influences. In 2017 it was, as I understand it, COSIMA that brought @jmunroe to Australia and he then helped @dougiesquire and I get Pangeo-style workflows deployed on CSIRO HPC. We became very interested in the importance of chunking and chunking strategies as we had a very large ( and large ensemble ) dataset to deal with. @dougiesquire took a proactive lead on testing strategies for rechunking and zarr format and I learned from that effort. It's very appropriate you link @dougiesquire 's basic notes on chunking.

What would be great is if someone at COSIMA had a real problem that all those interested could work through. The solutions are general but the details really matter (IMO) and going through the process together with a real problem might be a good way to start?

Thomas-Moore-Creative commented 3 months ago

@ongqingyee, @jemmajeffree, et al

Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of /scratch we could use collectively to build up some rechunking / ARD workflows together.

jemmajeffree commented 3 months ago

I agree that focussing on a specific problem is important. I don't think I have any current problems that would be useful for this. I'm currently working with 2D fields in large ensembles, and have thoroughly optimised importing these with xarray, but they were already chunked in a useful dimension anyway. I think, given the COSIMA output is currently chunked {time:1}, then the best example is probably something through time.

With 2D fields and <250 yrs monthly data, my approach is usually just to haul the whole thing into memory and then rechunk for analysis, so we'd probably need to use either daily data or deliberately try and do stuff on few cores for rechunking separately to make a difference.

I'd be interested in helping develop this tutorial, but I'm going to be a bit slow and unreliable while I'm still building up the courage to engage with the COSIMA github.

ongqingyee commented 3 months ago

@ongqingyee, @jemmajeffree, et al

Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of /scratch we could use collectively to build up some rechunking / ARD workflows together.

I have a simple example that can be good. It involves masking out a region around the Antarctic margin and calculating a circumpolar averaged surface speed. In my experience I found that masking and integrating on an xgcm grid makes applying chunking trickier. I'm happy to put together a draft notebook to start and changes/additions can be made? @Thomas-Moore-Creative @jemmajeffree

Working on the same notebook through github is new to me though, so help on the logistics of that would be great.

Thomas-Moore-Creative commented 3 months ago

@ongqingyee, @jemmajeffree, et al Do you have any current problems that could be tackled jointly to build up a first example? I'm attempting to source a small chunk of /scratch we could use collectively to build up some rechunking / ARD workflows together.

I have a simple example that can be good. It involves masking out a region around the Antarctic margin and calculating a circumpolar averaged surface speed. In my experience I found that masking and integrating on an xgcm grid makes applying chunking trickier. I'm happy to put together a draft notebook to start and changes/additions can be made? @Thomas-Moore-Creative @jemmajeffree

Working on the same notebook through github is new to me though, so help on the logistics of that would be great.

I'm a bit of a COSIMA outsider so others ( @navidcy ? @anton-seaice ? others ) might have something to say about where best to put your new example notebook on the repo and what practice is for branching? FWIW I'd suggest you start a new branch for this issue and others can then contribute via their own branches off your branch? Again, COSIMA regulars might have other views.

Your problem does seem to have a lot of detail so it would be good to see the code, what the source data is, and the goal for the final output. Thanks.

navidcy commented 3 months ago

What do you mean "branching" and "branches of your branch"? You are referring to repository branches?

The best place for an example is in the recipes directory. Open a PR and submit it. Does this clarify the question above?

Thomas-Moore-Creative commented 3 months ago

What do you mean "branching" and "branches of your branch"? You are referring to repository branches?

The best place for an example is in the recipes directory. Open a PR and submit it. Does this clarify the question above?

Hopefully I haven't already added confusion for @ongqingyee who was looking for more clarity around githubby things. =)

edoddridge commented 3 months ago

@Thomas-Moore-Creative is describing some more advanced GitHub techniques than we generally use in this repo.

When someone opens a PR, they have suggested changes that live on a branch in their repo. If I want to make changes to their PR (and I don't have write access to their PR branch), I can open a pull request based on the branch in the pull request. The original PR owner can then merge my PR into their PR, and then we can merge their PR in to the main repo.

As an example, you can look at this PR in MITgcm: https://github.com/MITgcm/MITgcm/pull/47 Gael Forget and Erik van Sebille both made pull requests on to my PR branch. Those changes were then incorporated in to the PR.

navidcy commented 3 months ago

Oh I see what you mean @Thomas-Moore-Creative

Thomas-Moore-Creative commented 3 months ago

Oh I see what you mean @Thomas-Moore-Creative

I think the most important point is that whatever the GitHub practice is that it's simple enough and/or supported enough so newbies can engage and make it to the next level of their Github life.

ongqingyee commented 3 months ago

What do you mean "branching" and "branches of your branch"? You are referring to repository branches?

The best place for an example is in the recipes directory. Open a PR and submit it. Does this clarify the question above?

This I can do. Thanks all!