Open QianqianHan96 opened 1 year ago
When I use my python script without Dask to predict the whole year 17000 steps, it used 100 GB memory.
Please add a link to your python script in this issue.
The script is in 2read10kminput-halfhourly-0628.ipynb.
Please add a link to the notebook 2read10kminput-halfhourly-0628.ipynb
in this issue.
When I use my python script without Dask to predict the whole year 17000 steps, it used 100 GB memory.
Please add a link to your python script in this issue.
Thanks for your advice, Sarah, I added it.
The script is in 2read10kminput-halfhourly-0628.ipynb.
Please add a link to the notebook
2read10kminput-halfhourly-0628.ipynb
in this issue.
Thanks for your advice, Sarah, I added it.
I have been trying to figure out from 3 points: 1) loading data, 2) data preprocessing including temporal and spatial resampling, 3) RF model. I found the data loading and preprocessing do not have problems, but the RF model caused the memory problem.
When I load the RF model outside of map_block function and then pass this model to map_blocks, the unmanaged memory is extremely high. Like what I said in the first post in this issue: when I tried to predict 1500 timesteps (the data of 1500 timesteps is 297 MB.), but always hit the worker memory limit (240 GB memory), I checked again that the memory is mostly unmanaged memory.
After I changed to pass the model path to map_blocks function instead of the model, the unmanaged memory seems normal, similar as managed memory. But the RF model is 245MB, I can not understand why pass it can cause so much unmanaged memory?
This experiment is still for 5 degree area. Now predicting 2000 timesteps and 5000 timesteps is no problem. But when I tried to predict 10000 timesteps, it failed before I increase either the memory or the CPU number. When I tried to predict 17000 timesteps, it failed before I increase both the memory and the CPU number. I have two questions: (1) When the code was running, the dask dashboard showed the memory was not so high, but sometimes the unmanaged memory is high. Why do we need 480 GB memory for 3 GB input data (17000 timesteps)? (2) When I predict for 2000 timesteps, I tried to increase the workers from 4 to 8, the running time was faster. However, when I predict for 5000 timesteps, it failed when I tried to use 8 workers. Why this? If it is like this, the running time can not be faster?
@geek-yang and @fnattino see the issue here
@geek-yang and @fnattino, Hi Yang, Francesco, I prepared the two jupyter notebooks. The code is in github now: 1) 2000 timesteps: https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/2read10kminput-halfhourly-0904_2000steps.ipynb. 2) 1 year: https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/2read10kminput-halfhourly-0904_1year.ipynb. I put the potential reasons in the beginning of the 1 year notebook. Please let me know if anything is not clear. Thanks for your help!
@geek-yang and @fnattino Hi Yang, Francesco,
I prepared the two jupyter notebooks. The code is in github now: 1) 1 year in 10 degree area: https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/0921_1year_10degree.ipynb. 2) 1 year in Europe: https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/0921_1year_Europe.ipynb. Good news is I managed to make the script run in 10 degree area. Bad news is Europe area is not working for now.
For 10 degree area, although I managed to make it run, maybe it would be better if you can help me check is the script correct or not? Specifically, I have the following 4 questions:
For Europe area, I did not manage to make it run. The error is in cell 55 of https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/0921_1year_Europe.ipynb. It seems it is data size problem, but even when I tried to predict 151 151 pixels (10 degree is 101101 pixels), it gave me same error.
All the input data is on snellius, you can directly run my script on snellius. I am using fat node and 32 CPU, 240 GB memory.
Based on the progress we got on June 27th, we managed to predict 100 timesteps with dask. I also managed to reduce the size of trained RF model from 15 GB to 245 MB. I can predict 200 timesteps (with 240GB memory). However, when I tried to predict 1500 timesteps (the data of 1500 timesteps is 297 MB.), always hit the worker memory limit (240 GB memory), no matter how many workers I use (4/32/64), the threads_per_worker is always 1. When I requested 960 GB, still hit the worker memory limit. When I use my python script (https://github.com/EcoExtreML/Emulator/blob/main/1computationBlockTest/2read10kminput-halfhourly-0616.py) without Dask to predict the whole year 17000 steps, it used 100 GB memory. I do not understand why Dask need so much memory. Could you help give some advice in terms of this problem? The script is at https://github.com/EcoExtreML/Emulator/blob/main/1computationBlockTest/2read10kminput-halfhourly-0628.ipynb.