Climate data processing with dask extremly slow

mari-s4e commented 1 year ago

This script https://github.com/FAIRiCUBE/uc1-urban-climate/blob/master/notebooks/f02_cube/subcubes_utci_stats.py processes 1 year of hourly climate data (around 14GB) to produce daily statistics for selected EU cities. I use dask to (hopefully) speedup data loading and processing. However, the script runtime is considerably slower when executed on FAIRiCube Hub than when executed locally.

The basic steps are (in parentheses the running times on local machine vs. on FAIRiCube Hub):

Download data from climate data store (same running time)
Unzip data (4 mins vs 15 mins)
Lazy-load data as xarray (2 mins vs. 15 mins)
start dask local cluster
compute statistics (8 mins vs 35 mins)
save statistics as shapefile (same running time)
delete original data from s3 to save space

My question is why is the processing so slow? Is there a problem in the code?

eox-cs1 commented 1 year ago

After cross-checking with our experts: its doesn't seem to be the script. It seems more likely that it is coming from the Storage. The permanent storage available is NFS-mounted and compared to your local, probably SSD, there is a big speed difference (up to 10-fold). But it could also be a limitation on the RAM side, so that the process is permanently swapping, which slows it down. What is your inital RAM setting?

mari-s4e commented 1 year ago

When starting the Hub I have the following settings:

CPU: 0.2 Memory: 213.55 MB Host CPU 0.6% used on 8 CPUs Host Virtual Memory Active: 820.46 MB Available: 29.19 GB Free: 27.51 GB Inactive: 1.74 GB Percent used: 4.8% Total: 30.67 GB Used: 1.03 GB Wired: 0.00 B

Does it help?

mari-s4e commented 1 year ago

Hello, I have performed some other benchmarks. I have tested the new UC1 large profile (with doubled resources). I think that the problem is not related to the memory size/number of CPUs. The script takes the same amount of time regardless of the resources (on FAIRiCube Hub). Here again some info about the machines (recorded with the Measurer developed by @cozzolinoac11). The "large" profile on FAIRiCube has double the resources of my local machine.

	Fairicube Hub large (365 days)	Local machine (365 days)
Data size (MB)		90.7421875
Main memory available (GB)	61.68314362	23.94339371
Main memory consumed (GB)	0.333139	0.339320951
CPU/GPU Machine type	x86_64	AMD64
CPU/GPU Processor type	x86_64	Intel64 Family 6 Model 60 Stepping 3 GenuineIntel
CPU/GPU Number of physical cores	8	4
CPU/GPU Number of logical cores	16	8
Network traffic (MB)	119131.431390762	14.760983467102
Wall time in seconds	8504.97295928001	682.038042545318

Can you @eox-cs1 see why is there such a difference in performance by looking at this data? The only thing I see is the difference in network traffic. Could it be that the process is slowed down by having to access the data on S3 bucket?

eox-cs1 commented 1 year ago

I learned from my colleagues that many small request to S3 will slow processes down very much, but many small requests wouldn't increase the data volume that much .

I looked at your code and didn't see any loops - so what astonishes me here is that you have 119 TB (!!) of network traffic compared to 14 GB.
Any ideas how you could possibly create such a high network traffic?

mari-s4e commented 1 year ago

Hi Christian, the root cause of the problem seems to be the combo NetCDF and cloud storage, cf. this forum. Reading NetCDF from S3 is slow because NetCDF is not a cloud optimized, and this is also causing the high network traffic (lots of requests to get the metadata). Zarr seems to be the cloud-optimized alternative to NetCDF. Which means that we have to rethink how we handle climate data, since it usually comes in NetCDF format. Btw, it is 119 GB, not TB, I had forgotten to had the unit (fixed).

@KathiSchleidt this is probably information relevant for you as well.

eox-cs1 commented 1 year ago

Hi Maria, Yes, you seem to be right. Although netCDF was developed for net-access it doesn't seem to work well on Object Storages. Network traffic: 119GB is much better but still ~8,5 times of the 14GBs. The issue of netCDF with S3 seems to be already known and a search provided some possible solutions you could try to speed things up. The listing is not in any order. https://pypi.org/project/s3fs/ https://github.com/fsspec/s3fs/issues/168+ https://github.com/meracan/s3-netcdf https://pypi.org/project/S3netCDF4/ https://stackoverflow.com/questions/43197223/using-aws-s3-and-apache-spark-with-hdf5-netcdf-4-data/60885374#60885374 https://nasa-openscapes.github.io/2021-Cloud-Workshop-AGU/how-tos/Multi-File_Direct_S3_Access_NetCDF_Example.html https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314 https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe68

KathiSchleidt commented 12 months ago

@mari-s4e @eox-cs1 I don't have the time to read the various pages, but based on the links, seems the general recommendation is to first convert the NetCDF to Zarr.

Regardless - please tell me what solution you come up with!!!

eox-cs1 commented 12 months ago

one might do that with its own datasets but for external data stores this will not likely be an option. the links above provide 2-3 alternative S3 access options which could be benchmarked

mari-s4e commented 11 months ago

Hi @eox-cs1, I have reviewed your suggestions:

with rioxarray it is a bit better, but still reading nc files from s3 is 2x slower compared to reading from local disk. Nevertheless 👍 because it solves other errors I had loading the files.
S3netCDF4looked promising but I cannot install the library (pip install throwing errors), and the project seems stale
I am not sure how s3fs can help here, since the s3 bucket is mounted to the EOX workspace, I can navigate it as part of the file system. Or am I missing something?

mari-s4e commented 4 months ago

Issue resolved: traditional file formats (e.g. tiff, netCDF) cause a lot of network traffic and slow down the computation when the file resides on the cloud. Cloud-optimized format like COG, zarr are designed to overcome this problem.

FAIRiCUBE / uc1-urban-climate

Climate data processing with dask extremly slow #3