Closed vanroekel closed 7 years ago
@vanroekel, this means we need to set a smaller chunk size, e.g., make https://github.com/MPAS-Dev/MPAS-Analysis/blob/develop/config.default#L77 a smaller number.
Some background on this -- I previously raised some issues (https://github.com/pydata/xarray/issues/1338 and https://github.com/dask/dask/issues/2138) with the xarray and dask developers and if I understand correctly, in the future the plan is to raise a more meaningful warning.
@pwolfram is there any guidance on where this value should be set?
@vanroekel, please see http://xarray.pydata.org/en/stable/dask.html#chunking-and-performance, specifically:
A good rule of thumb to create arrays with a minimum chunksize of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up dask operations can be noticeable, and you may need even larger chunksizes.
@pwolfram I was able to compute 10 years of MOC by experimenting with the chunk size. It was a bit frustrating to have to play with this value multiple times to get this to work. Is there a way to improve documentation on how to set this (your explanation above suggests to me that the maxChunkSize should be increased not decreased)? Or better yet set this automatically? I can imagine someone on the coupled team getting frustrated (as I was) trying to figure out how to set that parameter. Given the error that is printed (dask.async.memory) I don't see how a user would know to go to config.default and change that parameter, much less how to change the parameter.
I just checked the output from the 10 year run I discussed above and I don't see any plots. Looking back at the output, it did not finish (I see "Killed") after 90 minutes of computation. Do you think I should play with chunkSize more?
A separate question, has the MOC calculation been run on edison for Chris 60to30 coupled case for multiple years? I'm wondering if this is an anvil specific issue.
Let's discuss this tomorrow and make it a priority, since the MOC is in the v0.2 version that @xylar and I are working on pulling into ACMEPreAndProcessing
.
Thanks for all your work, @vanroekel!
Yes, I did run the MOC on the beta1 runs, even 30 years. This was before we had the maxchunk fix (but I basically was lucky to never run into a memory problem on the login node..).
Confirming that this appears to be an anvil issue. I can run this on edison (10 years of beta1_2) with no issues. Still no luck on anvil. I have done many tests across a wide range of chunksizes, with no success.
After talking with @vanroekel it sounds like an anvil-specific issue. We are trying this with all conda-forge
packages and will try again.
@pwolfram I've updated all packages to the latest and greatest and all use conda-forge, but still no dice.
I'll have to leave this for now. I don't have time to work on this anymore.
@vanroekel: things work for you now, with the caching in place, right? if so, I think we can close this.
This issue has been addressed by #177
I've been trying to process simulation output from a G-case (60to30) on anvil. If I do 1 year, the MOC calculates fine, but if I do ten years, I see the following
I have checked my conda environment to verify it is up to date. Is there some config option I need to set? I saw some file percent option and a file chunk size. Do I need to set those differently? I'm using the default values for both now.