Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 4 forks source link

investigate high memory usage for agera5 at sentinel-2 resolution #413

Open jdries opened 1 year ago

jdries commented 1 year ago

job id: j-f1b1efdb2c6e4fc680f1ddedde0b5f91 user had to set executor memory very high

Jobs were crashing when writing the actual netcdfs. filter_spatial was used, so all data for a single netcdf ends up on an executor. The size of netcdf's was: 25625636648/(1024*1024) = 732MB

relevant logging:

Creating layer for AGERA5 with load params {'temporal_extent': ('2020-01-01', '2020-12-31'), 'spatial_extent': {'west': 3.016675851392828, 'south': 44.25828932897241, 'east': 4.37861196995034, 'north': 45.15020643463023, 'crs': 'EPSG:4326'}, 'global_extent': {'west': 3.016675851392828, 'south': 44.25828932897241, 'east': 4.37861196995034, 'north': 45.15020643463023, 'crs': 'EPSG:4326'}, 'bands': ['dewpoint-temperature', 'precipitation-flux', 'solar-radiation-flux', 'temperature-max', 'temperature-mean', 'temperature-min', 'vapour-pressure', 'wind-speed'], 'properties': {}, 'aggregate_spatial_geometries': <shapely.geometry.multipolygon.MultiPolygon object at 0x7febd6690520>, 'sar_backscatter': None, 'process_types': {<ProcessType.FOCAL_SPACE: 6>}, 'custom_mask': {}, 'data_mask': None, 'target_crs': {'$schema': 'https://proj.org/schemas/v0.2/projjson.schema.json', 'type': 'GeodeticCRS', 'name': 'AUTO 42001 (Universal Transverse Mercator)', 'datum': {'type': 'GeodeticReferenceFrame', 'name': 'World Geodetic System 1984', 'ellipsoid': {'name': 'WGS 84', 'semi_major_axis': 6378137, 'inverse_flattening': 298.257223563}}, 'coordinate_system': {'subtype': 'ellipsoidal', 'axis': [{'name': 'Geodetic latitude', 'abbreviation': 'Lat', 'direction': 'north', 'unit': 'degree'}, {'name': 'Geodetic longitude', 'abbreviation': 'Lon', 'direction': 'east', 'unit': 'degree'}]}, 'area': 'World', 'bbox': {'south_latitude': -90, 'west_longitude': -180, 'north_latitude': 90, 'east_longitude': 180}, 'id': {'authority': 'OGC', 'version': '1.3', 'code': 'Auto42001'}}, 'target_resolution': [10, 10], 'resample_method': 'cubic', 'pixel_buffer': None}

Loading with params DataCubeParameters(256, {}, FloatingLayoutScheme, ByDay, 6, None, CubicConvolution, 0.0, 0.0) and bands dewpoint-temperature;precipitation-flux;solar-radiation-flux;temperature-max;temperature-mean;temperature-min;vapour-pressure;wind-speed initial layout: LayoutDefinition(Extent(501310.0, 4898170.0, 613950.0, 5000570.0),CellSize(10.0,10.0),22x20 tiles,11264x10240 pixels)

Cube partitioner index: SparseSpaceTimePartitioner 1656 true

Datacube is sparse: true, requiring 46 keys out of 420.

jdries commented 1 year ago

Not entirely sure yet if this is the only problem, but would be good to reduce memory usage of writing netcdf samples. Perhaps we can consider creating compressed tiles and decompress only right before going into the netcdf, so allowing us to write more in a streaming manner. Other option would be a file format like zarr,.