Closed EmileSonneveld closed 3 months ago
The tilesize
workaround comes from: https://jira.vito.be/projects/EP/issues/EP-3738
The tilesize workaround seems to be hanging too: tilesize 16: j-2402207d2dde4cfab9209d3f76815bde tilesize 1: j-2402206838d141e2a1193635ce461873
Ok, we'll want to look at the partitioner then. Can you indicate priority/urgency, to properly schedule?
From: Emile @.> Sent: Wednesday, February 21, 2024 9:15 AM To: Open-EO/openeo-geotrellis-extensions @.> Cc: Subscribed @.***> Subject: Re: [Open-EO/openeo-geotrellis-extensions] Out of memory after resampling AGERA5 (Issue #265)
The tilesize workaround seems to be hanging too: tilesize 16: j-2402207d2dde4cfab9209d3f76815bde tilesize 1: j-2402206838d141e2a1193635ce461873
— Reply to this email directly, view it on GitHubhttps://github.com/Open-EO/openeo-geotrellis-extensions/issues/265#issuecomment-1956103270, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNJPSFULK5GNTNP7PIFM7DYUWUI7AVCNFSM6AAAAABDRHPYKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJWGEYDGMRXGA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
VITO Disclaimer: http://www.vito.be/e-maildisclaimer
Finding a workaround in the following days would be nice. As we want to soon pass it to the integrators.
After the resampling, the size should be around 10'000x6'000px, compressed that is around 35Mb per timestamp
@EmileSonneveld I will still be stuck for today, so pointing out where something needs to change if you want to give it a try: https://github.com/Open-EO/openeo-geotrellis-extensions/blob/e2496005325faf5afc01f510aa9480cb0046156c/openeo-geotrellis/src/main/scala/org/openeo/geotrellis/OpenEOProcesses.scala#L734
Now the question will be which partitioner you'll want to configure to fix it. To answer that, it would be good to know which partititoner is actually configured when it enters resample_cube_spatial. You could consider adding logging for that, or use a test with a similar graph.
For target partitioner, a 'regular' spacetime partitioner would probably work. (Anything that would break up the larger partitions into smaller ones, as the final partitioner is anyway determined by the target datacube.)
Relevant logging:
Cube partitioner index: SparseSpaceTimePartitioner 3007776 false
Created cube for with metadata TileLayerMetadata(uint16ud65535,LayoutDefinition(Extent(9.949999999999989, -40.14999999999999, 40.14999999999999, -19.94999999999999),CellSize(0.1,0.10000000000000002),302x202 tiles,302x202 pixels),Extent(10.0, -40.0, 40.0, -20.0),EPSG:4326,KeyBounds(SpaceTimeKey(0,0,1420070400000),SpaceTimeKey(300,200,1688083200000))) and partitioner Some(SpacePartitioner(KeyBounds(SpaceTimeKey(0,0,1420070400000),SpaceTimeKey(300,200,1688083200000))))
apply_neighborhood created datacube Metadata(bounds=Bounds(minKey=SpaceTimeKey(col=0, row=0, instant=datetime.datetime(2015, 1, 1, 0, 0)), maxKey=SpaceTimeKey(col=300, row=200, instant=datetime.datetime(2023, 6, 30, 0, 0)))cellType=float32noDataValue=nancrs=+proj=longlat +datum=WGS84 +no_defs extent=Extent(xmin=10.0, ymin=-40.0, xmax=40.0, ymax=-20.0)tileLayout=TileLayout(layoutCols=302, layoutRows=202, tileCols=1, tileRows=1)layoutDefinition=LayoutDefinition(extent=Extent(xmin=9.949999999999989, ymin=-40.14999999999999, xmax=40.14999999999999, ymax=-19.94999999999999), tileLayout=TileLayout(layoutCols=302, layoutRows=202, tileCols=1, tileRows=1))
@EmileSonneveld the job with tilesize 1 is no longer available in spark history because it keeps a limited number of applications. You really want to open spark ui and take screenshots or retrieve number of partitions to determine the effect. Also, as this is a memory issue, did you already try increasing it? That may help to determine how big your partitions get after resampling.
@EmileSonneveld the bounding box you have in there is pretty large, isn't south africa much smaller?
Initial fix has some effect, the partitioner is 'None' after applying the UDF so we need to see how to deal with that. Perhaps jep runtime can be a better option, allowing to retain a partitioner...
▸
Reprojecting datacube with partitioner None to new layout LayoutDefinition(extent=Extent(xmin=14.949999999999989, ymin=-34.14999999999999, xmax=32.15238095237819, ymax=-20.947619047621153), tileLayout=TileLayout(layoutCols=1445, layoutRows=1109, tileCols=4, tileRows=4)) and 4326
+10m 4s 165msINFO
ID: [1708679240379, 31526]
▸
Repartitioning datacube with 1419 partitions to 14190 before resample_spatial.
The exact extent is indeed a bit smaller. I am re-launching the task with tilesize 16 and a smaller extent:
spatial_extent={"west": 16.448304, "south": -46.980603, "east": 37.998802, "north": -22.12718, }
@EmileSonneveld the memory explosion is quite apparent in spark ui:
This raises the question if there are ways to reduce the cube size? A common quick win is datatype conversion. I guess we also can't get rid of the daily time resolution?
For the SPI the data is pre-aggregated to monthly, over a longer time period. But in this example there is indeed 6x the amount of dates. Original code here: https://git-ext.gmv.com/anin-external/drought-indices/-/blob/openeo/SPI/SPI_openeo.py
can you run again yourself on openeo-dev? If you have a failing example, just inspect spark ui for some details and provide process graph in this ticket. From the logs, try retrieving the lines about resampling and the partitioner that's used before resampling.
This morning I launched the task again:j-240226bf712544d39465424b234b1a34
Tested on openeo-dev
with stac catalog and load_collection
on AGERA5. For the load_collection
, I specified a smaller temporal extent, because it has a larger temporal resolution
Doing a resample_spatial from low to high resolution can gives OOM errors here.
Task that shows OOM: https://epod204.vgt.vito.be:8042/node/containerlogs/container_e5123_1707931224191_16000_01_000025/emile.sonneveld/stdout?start=-4096
j-240220be46af48c78e55462095630792
(link might expire) Task without resample that does not OOM:j-240220ab5f1644108c9ac35d1108fe47
Potential workaround: