coecms / xmhw

Xarray version of Marine Heatwaves code by Eric Olivier
https://xmhw.readthedocs.io/en/latest/
Apache License 2.0
22 stars 10 forks source link

Can code handle 3D data, i.e. lat,lon and depth without looping over gridpoints? #24

Closed sryan288 closed 3 years ago

sryan288 commented 3 years ago

Hi, great effort and thanks for making this code public! I think an xarray/dask version is definitely needed. I actually started working on this too over the last few days following Eric's code and this one https://github.com/ritwik1111/MHW. I (and my student) started working with high-resolution model output on daily resolution and will definitely need to come up with a code that can operate on dask and ideally in 3D fields without having to loop through grid points. Is your code capable of doing that? Let me know and maybe we could set up a zoom meeting if that is easier to discuss. Would be a waste of time if everyone is trying to do their own thing. Cheers, Svenja

paolap commented 3 years ago

Hi Svenja,

the code should handle 3D data, the first thing it does it's to "stacked" all the grid points along time, so it doesn't matter how many dimensions there are or how the coordinates are called. We should definitely have a zoom meeting, so I can explain better how this was developed and pros and cons, I would love for someone to adopt it and I'm actively working on it now. I'm not a MHW researcher myself, but we have quite a few in our centre and I'm now getting feedback from their group. I've just added my email to my profile, feel free to contact me if you want a meeting, even today would be fine for me, as I'm trying to work out what the next steps should be.

Thomas-Moore-Creative commented 3 years ago

Hey @paolap I think I have a similar question to @sryan288? Apologies in advance if this turns out to be my ignorance. I've forked your public repo and I'm working on the alternative branch.

I'm using some of the code out of one of your example notebooks: https://github.com/coecms/xmhw/blob/alternative/docs/xmhw_demo.ipynb

Again this may be my ignorance but when I get to climds = threshold(ts_data_da.isel(lat=slice(0,10),lon=slice(0, 10))) the behaviour I see on my cluster appears like we are "looping" over the 100 individual lat/lon points? If I work with a large cluster I don't seem to get any scaling benefit and the time taken to compute just depends on the number of lat/lon points?

In my case I'm working with some high-res SST data already "unchunked" in time.

Thanks again for sharing this work.

paolap commented 3 years ago

So yes, it can handle 3D or any dimensions as it stacked all the dimensions that are not time. As for the loop , it's correct once it is stacked it will loop cell by cell, but I'm using functions decorated by dask.delayed. In this branch: runavg(), calculate_thresh(), calculate_seas() and define_events are delayed. Initially I used a different fully xarray approach but that was more cumbersome for some operations and so I went back to use pandas (for groupby in the events detection part)and numpy for some operations. I never used a really big amount of cpus, I found 4 with the chunking I had were optimal. I also used the hugemem queue job on gadi (at least for the threshold). I just run it for one of out student on the Cosima ocean model output. I ended up breaking his grid in 4 , mostly to test different approaches. It was definitely running in parallel as I was getting 50% + cpus usage. It's not great but isn't bad. He had 30 years of data 150X160 grid for each grid sector. Time of running ranged from 5 to 18 hours depending on number on land cells, chunking etc. Hendrik used more cores for his calculation but I'm not sure how efficiently they were used. Having more delayed function makes the calculation more parallel, but at the same time increases the memory used. I'm sure this could be optimised better. One of the reason I started this is that we had users complaining of how long it took to run, it used to take them days. I think Eric code can be speed up in the same way, using delayed or other dask approaches. As I said today the main advantage of this is to have some of that integrated, the modularity and the way the detect function manages the final data. Keep in mind also, that I haven't documented the last changes yet, as I only finished yesterday, that notebook is quite old. I haven't tested yet but I wanted to try for big grid to use one of the map functions, I'm really still learning how to work with big datasets.

Thomas-Moore-Creative commented 3 years ago

Thanks for that detailed reply @paolap! First, how wonderful to have you working on this and supporting others data science tasks!

I'm sure some of my questions here are related to my own ignorance - I need to dig deeper into how I'm running the methods you are providing. My expectation that I should be able to scale over cells/pixels for a high-resolution, large dataset might be misplaced?

I might wait to see the new documentation and, if it's helpful, start an issue for scaling / testing with some standard dataset?

paolap commented 3 years ago

If you can do some scaling testing would be great, or if you can share the data you're using I was going to do some myself. The reality is I need help to work out this, I had Scott having a look from time to time to see that the approach was sensible but no one really run a performance. Your expectation should be correct, using delayed should parallelise the loop, the limiting factors could be the way the dataset is chunked. Apart from the obvious the the entire time series for one cell has to be in one chunk the code doesn't handle the chunking of the other dimensions. As I said when I run it the cpu usage was decent, I didn't find yet to set up a reasonable Dask Client even using Scott gadi_client from climates, but then I always try this in a hurry.

paolap commented 3 years ago

I just found the details of Hendrik simulation

He run it on a model output including all depth levels as well

169X196X20 grid points

This is the count of only the ocean points, so the ones that effectively get some calculation, the actual grid is bigger.

The all dataset was 40 GB, he loaded before running the threshold function on a jupyter notebook, He used 32 cores and max memory usage was 58 GB, it took 24 hr 30 mn to run Unfortunately I don't have details of actual cpu usage, or how many workers there were for core, in my experience using those many cores might not be necessary.