chat? - Githubissues

dcherian commented 7 months ago

I saw the link to a flox issue and from a quick browse it looks like you're struggling a bit trying to get it to work well.

Wanna chat?

I wrote flox with the goal of making climatology generation and compositing "just work", so this is a bit sad :(

Thomas-Moore-Creative commented 7 months ago

@dcherian !!!! . . . this has to be the BEST open source "customer service" in coding history. A great example of what being a bit courageous and making all your (possibly terrible) code and (possibly silly) problems open to the world for all to see can do. What is possibly "a bit sad" is my understand and approach - so I'd really appreciate a chat.

Right now it's Friday night (Tasmania time) and my little family and I are going camping so I likely won't get back to you on this until Monday (Tasmania time).

dcherian commented 7 months ago

hehehe I'm obsessed with solving this problem! No rush of course but happy to chat next week.

Some general comments (applicable to latest flox/xarray)

If you are not grouping by multiple variables, you can use the much nicer ds.groupby("time.month").mean() syntax and it will use flox automatically if installed.
Over the past few months, I've worked on setting method automatically looking at how the groups are distributed across chunks. You should not have to set method. This will let flox choose what's appropriate to whatever chunking you have.
That said (2) runs on heuristics so it's possible that it doesn't make the right choice. Details will help, particularly what are you grouping by, and how is the big array chunked along the core dimensions of the groupby operation.
Are you running in to trouble with the groupby reduction, or on calculating anomalies w.r.t the grouped-reduced values? With dask, you want to do those two steps separately, otherwise memory issues will often result.
It's tempting to simply say .chunk(time=30) for e.g. but really you should think about rechunking to a frequency (e.g. monthly or two-monthly). We don't have nice syntax for this yet (https://github.com/pydata/xarray/issues/7559) but you can quite easily figure out the appropriate chunk tuple with ds.time.resample(time="2M").count(). If properly done, flox will then choose either "cohorts" or "blockwise" automatically for you, and save some memory. Here's an example: https://flox.readthedocs.io/en/latest/user-stories/climatology.html#rechunking-data-for-cohorts

Thomas-Moore-Creative commented 7 months ago

hehehe I'm obsessed with solving this problem!

And for this, @dcherian, many are very grateful!

People likely think I'm obsessed too, but my progress has been slow, despite your excellent documentation. Part / much of this is possibly another issue #13 where xr.open_mfdataset was reporting an object loaded from netcdf - short datatype being float32, which was expected, but when computed it became float64? I didn't notice this at first and it caused me unexpected memory issues. I'm not suggesting there is any bug in xr.open_mfdataset more that I don't understand how it works with the specific netcdf - short variables I'm loading. For now I'm forcing float32 with .astype.

My open repo here is mainly my personal work notes and so I'm sure it's not easy to understand or follow. I'll apply what you've said above and reframe the problem for you below.

Thomas-Moore-Creative commented 7 months ago

@dcherian - I've summarised things here #17 and will document things in that issue. Would appreciate any comments you have over there in #17 . Thanks.

Thomas-Moore-Creative / Climatology-generator-demo

chat? #16