Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 4 forks source link

merge_cubes: invalid band merging #508

Open soxofaan opened 12 months ago

soxofaan commented 12 months ago

Related to https://github.com/Open-EO/openeo-geopyspark-driver/issues/507 (and https://github.com/Open-EO/openeo-aggregator/issues/115):

merge_cubes of two cubes with different bands is not handled correctly:

temporal_extent1 = ["2023-08-20", "2023-09-01"]
spatial_extent1 = {"west": 3.00, "south": 51.00, "east": 3.10, "north": 51.10}
bands1 = ["B02"]

cube1 = connection.load_collection(
    "TERRASCOPE_S2_TOC_V2",
    temporal_extent=temporal_extent1,
    spatial_extent=spatial_extent1,
    bands=bands1,
)

bands2 = ["B03", "B04"]
cube2 = connection.load_collection(
    "TERRASCOPE_S2_TOC_V2",
    temporal_extent=temporal_extent1,
    spatial_extent=spatial_extent1,
    bands=bands2
)

cube = cube1.merge_cubes(cube2, overlap_resolver="max")
res = cube.save_result(format="netCDF")

process graph:

{"process_graph": {"loadcollection1": {"process_id": "load_collection", "arguments": {"bands": ["B02"], "id": "TERRASCOPE_S2_TOC_V2", "spatial_extent": {"west": 3.0, "south": 51.0, "east": 3.1, "north": 51.1}, "temporal_extent": ["2023-08-20", "2023-09-01"]}}, "loadcollection2": {"process_id": "load_collection", "arguments": {"bands": ["B03", "B04"], "id": "TERRASCOPE_S2_TOC_V2", "spatial_extent": {"west": 3.0, "south": 51.0, "east": 3.1, "north": 51.1}, "temporal_extent": ["2023-08-20", "2023-09-01"]}}, "mergecubes1": {"process_id": "merge_cubes", "arguments": {"cube1": {"from_node": "loadcollection1"}, "cube2": {"from_node": "loadcollection2"}, "overlap_resolver": {"process_graph": {"max1": {"process_id": "max", "arguments": {"data": [{"from_parameter": "x"}, {"from_parameter": "y"}]}, "result": true}}}}}, "saveresult1": {"process_id": "save_result", "arguments": {"data": {"from_node": "mergecubes1"}, "format": "netCDF", "options": {}}, "result": true}}}

running this as a job fails with:

java.lang.IllegalArgumentException: Merging cubes with an overlap resolver is only supported when band counts are the same. I got: 1 and 2
at org.openeo.geotrellis.OpenEOProcesses.$anonfun$resolve_merge_overlap$2(OpenEOProcesses.scala:801)
at org.apache.spark.rdd.PairRDDFunctions.$anonfun$mapValues$3(PairRDDFunctions.scala:752)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)

What should happen is creating a cube with three bands ["B02", "B03", "B04"], not trying to combine ["B02"] values with ["B03", "B04"] values

jdries commented 12 months ago

the solution/workaround here is to not specify an overlap resolver

soxofaan commented 12 months ago

Another use case and symptom : using same number of bands, but different bands:

bands1 = ["B02"]
cube1 = connection.load_collection(
    ...
    bands=bands1,
)

bands2 = ["B03"]
cube2 = connection.load_collection(
    ...
    bands=bands2
)

This fails with user facing error

YARN application status reports error diagnostics: User application exited with status 139

And I have to manually look into the YARN logs to find:

malloc_consolidate(): invalid chunk size
Fatal Python error: Aborted
...
Current thread 0x00007f71292ca100 (most recent call first):
  File "/usr/lib64/python3.8/site-packages/osgeo/gdal.py", line 5896 in InfoInternal
  File "/usr/lib64/python3.8/site-packages/osgeo/gdal.py", line 413 in Info
  File "batch_job.py", line 647 in read_gdal_info
  File "batch_job.py", line 831 in _process_gdalinfo_for_netcdf_subdatasets
  File "batch_job.py", line 680 in parse_gdal_raster_metadata
  File "batch_job.py", line 617 in read_gdal_raster_metadata
  File "batch_job.py", line 519 in _extract_asset_raster_metadata
  File "batch_job.py", line 387 in _extract_asset_metadata
  File "batch_job.py", line 260 in _assemble_result_metadata
  File "batch_job.py", line 1208 in run_job
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/utils.py", line 52 in memory_logging_wrapper
  File "batch_job.py", line 999 in run_driver
  File "batch_job.py", line 1028 in main
  File "batch_job.py", line 1291 in <module>
soxofaan commented 12 months ago

indeed, dropping the overlap resolver from cube1.merge_cubes(cube2) makes both use cases above work.

soxofaan commented 12 months ago

there is however no workaround yet for use case with actual overlap:

bands1 = ["B02", "B03"]
cube1 = connection.load_collection(
    ...
    bands=bands1,
)

bands2 = ["B03", "B04"]
cube2 = connection.load_collection(
    ...
    bands=bands2
)

Without an overlap resolver specified, it fails complaining there should be one. With an overlap resolver, it fails without a usable user facing error message. In YARN logs I find:

corrupted size vs. prev_size
Fatal Python error: Aborted
...
Current thread 0x00007f3ab5d29100 (most recent call first):
  File "/usr/lib64/python3.8/site-packages/osgeo/gdal.py", line 5896 in InfoInternal
  File "/usr/lib64/python3.8/site-packages/osgeo/gdal.py", line 413 in Info
  File "batch_job.py", line 647 in read_gdal_info
  File "batch_job.py", line 831 in _process_gdalinfo_for_netcdf_subdatasets
  File "batch_job.py", line 680 in parse_gdal_raster_metadata
  File "batch_job.py", line 617 in read_gdal_raster_metadata
  File "batch_job.py", line 519 in _extract_asset_raster_metadata
  File "batch_job.py", line 387 in _extract_asset_metadata
  File "batch_job.py", line 260 in _assemble_result_metadata
  File "batch_job.py", line 1208 in run_job
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/utils.py", line 52 in memory_logging_wrapper
  File "batch_job.py", line 999 in run_driver
jdries commented 12 months ago

that last error is related to metadata assembly. So either it is not the real error, or the workflow does succeed but it's simply the new metadata generation that fails.