Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

Batch job results don't contain any items when saved as netCDF #646

Closed VincentVerelst closed 3 months ago

VincentVerelst commented 5 months ago

When saving results of a batch job as netCDF, the resulting STAC collection doesn't contain any items and therefore cannot be loaded using load_stac.

For example the following batch job: j-2401190a5a2144868480ccba676ee9db.json

will result in the following STAC metadata: job-results.json

Using load_stac on these results will result in a NoDataAvailable Exception.

bossie commented 5 months ago

This is at least part of the reason:

https://github.com/Open-EO/openeo-python-driver/blob/6e6f0f0fd462f3bd1aad491a8b022c3fdd8f1de7/openeo_driver/views.py#L1102C60-L1102C97

netCDF assets with a time dimension could be problematic.

VincentVerelst commented 5 months ago

In the batch job from the example above the netCDF's don't have a time dimension FYI.

jdries commented 5 months ago

The problem I see is that extraction jobs generate many netcdf's in one job, while this method: https://github.com/Open-EO/openeo-python-driver/blob/1d86962102e686de71202a90838c659a53d33170/openeo_driver/views.py#L1267C9-L1267C29

will assume that job bbox is also item bbox. I think we need an approach that generates item json as part of the batch job?

jdries commented 5 months ago

using datacube extension in items also seems relevant in the case of netcdf: https://github.com/stac-extensions/datacube/blob/main/examples/item.json

bossie commented 5 months ago

For reference, GeoTIFF equivalent seems to have been implemented in EP-4118.

bossie commented 5 months ago

Related (the bands part): https://github.com/Open-EO/openeo-geotrellis-extensions/issues/259

bossie commented 4 months ago

Not yet available on openeo-dev because the integration tests fail (for unrelated reasons). Fails on CDSE dev/staging because of https://github.com/eu-cdse/openeo-cdse-infra/issues/55.

bossie commented 4 months ago

load_stac of those results on openeo-dev results in a GDAL error:

Traceback (most recent call last):
  File "batch_job.py", line 1278, in <module>
    main(sys.argv)
  File "batch_job.py", line 1013, in main
    run_driver()
  File "batch_job.py", line 984, in run_driver
    run_job(
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/utils.py", line 54, in memory_logging_wrapper
    return function(*args, **kwargs)
  File "batch_job.py", line 1077, in run_job
    result = ProcessGraphDeserializer.evaluate(process_graph, env=env, do_dry_run=tracer)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 373, in evaluate
    result = convert_node(result_node, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 398, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1558, in apply_process
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1558, in <dictcomp>
    args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())}
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 412, in convert_node
    return convert_node(processGraph['node'], env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 398, in convert_node
    process_result = apply_process(process_id=process_id, args=processGraph.get('arguments', {}),
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 1590, in apply_process
    return process_function(args=ProcessArgs(args, process_id=process_id), env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py", line 2199, in load_stac
    return env.backend_implementation.load_stac(url=url, load_params=load_params, env=env)
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 1091, in load_stac
    pyramid = pyramid_factory.datacube_seq(projected_polygons, from_date.isoformat(), to_date.isoformat(),
  File "/opt/spark3_4_0/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
  File "/opt/spark3_4_0/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o1677.datacube_seq.
: java.io.IOException: Exception while determining data type of collection https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/7806fa7bc01110e93a16c7d65e599c21?expires=1708522345 and item NETCDF:/vsicurl/https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785:B04. Detailed message: Unable to determine NoData value. GDAL Exception Code: 4
    at org.openeo.geotrellis.layers.FileLayerProvider.determineCelltype(FileLayerProvider.scala:728)
    at org.openeo.geotrellis.layers.FileLayerProvider.readKeysToRasterSources(FileLayerProvider.scala:758)
    at org.openeo.geotrellis.layers.FileLayerProvider.readMultibandTileLayer(FileLayerProvider.scala:957)
    at org.openeo.geotrellis.file.PyramidFactory.datacube(PyramidFactory.scala:128)
    at org.openeo.geotrellis.file.PyramidFactory.datacube_seq(PyramidFactory.scala:91)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: geotrellis.raster.gdal.MalformedDataTypeException: Unable to determine NoData value. GDAL Exception Code: 4
    at geotrellis.raster.gdal.GDALDataset$.$anonfun$noDataValue$1(GDALDataset.scala:313)
    at geotrellis.raster.gdal.GDALDataset$.$anonfun$noDataValue$1$adapted(GDALDataset.scala:310)
    at geotrellis.raster.gdal.GDALDataset$.errorHandler$extension(GDALDataset.scala:422)
    at geotrellis.raster.gdal.GDALDataset$.noDataValue$extension1(GDALDataset.scala:310)
    at geotrellis.raster.gdal.GDALDataset$.cellType$extension1(GDALDataset.scala:366)
    at geotrellis.raster.gdal.GDALDataset$.cellType$extension0(GDALDataset.scala:361)
    at geotrellis.raster.gdal.GDALRasterSource.$anonfun$cellType$1(GDALRasterSource.scala:91)
    at scala.Option.getOrElse(Option.scala:189)
    at geotrellis.raster.gdal.GDALRasterSource.cellType$lzycompute(GDALRasterSource.scala:91)
    at geotrellis.raster.gdal.GDALRasterSource.cellType(GDALRasterSource.scala:91)
    at org.openeo.geotrellis.layers.BandCompositeRasterSource.$anonfun$cellType$1(FileLayerProvider.scala:92)
    at cats.data.NonEmptyList.map(NonEmptyList.scala:87)
    at org.openeo.geotrellis.layers.BandCompositeRasterSource.cellType(FileLayerProvider.scala:92)
    at org.openeo.geotrellis.layers.FileLayerProvider.determineCelltype(FileLayerProvider.scala:722)
    ... 16 more

A gdalinfo as well as a GDALRasterSource of that asset URL work on my machine but not from the web app driver on openeo-dev. To investigate.

bossie commented 4 months ago

gdalinfo with debug output in driver container:

bash-4.4$ CPL_DEBUG=ON gdalinfo NETCDF:/vsicurl/https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785:B04
GDAL: CPLIsUserFaultMappingSupported(): syscall(__NR_userfaultfd) failed: insufficient permission. add CAP_SYS_PTRACE capability, or set /proc/sys/vm/unprivileged_userfaultfd to 1
HTTP: libcurl/7.61.1 OpenSSL/1.1.1k zlib/1.2.11 nghttp2/1.33.0
VSICURL: GetFileSize(https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785)=45482  response_code=200
VSICURL: Downloading 0-16383 (https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785)...
VSICURL: Got response_code=206
ERROR 4: NETCDF:/vsicurl/https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785:B04: No such file or directory
gdalinfo failed - unable to open 'NETCDF:/vsicurl/https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785:B04'.

gdalinfo with debug output on my machine:

bossie@rastapopoulos:~/opt/gdal-3.7.0/installed/bin$ CPL_DEBUG=ON ./gdalinfo NETCDF:/vsicurl/https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785:B04
./gdalinfo: error while loading shared libraries: libgdal.so.33: cannot open shared object file: No such file or directory
bossie@rastapopoulos:~/opt/gdal-3.7.0/installed/bin$ LD_LIBRARY_PATH=$(readlink -f ../lib) CPL_DEBUG=ON ./gdalinfo NETCDF:/vsicurl/https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785:B04
HTTP: libcurl/7.81.0 GnuTLS/3.7.3 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.16
VSICURL: GetFileSize(https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785)=45482  response_code=200
VSICURL: Downloading 0-16383 (https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785)...
VSICURL: Got response_code=206
GDAL_netCDF: driver detected file type=3, libnetcdf detected type=4
GDAL_netCDF: setting file type to 4, was 3
GDAL_netCDF: var_count = 5
GDAL_netCDF: 
=====
SetProjectionFromVar( 65536, 3)
GDAL_netCDF: got grid_mapping crs
GDAL_netCDF: setting WKT from GDAL
GDAL_netCDF: bIsGdalFile=0 bIsGdalCfFile=0 bSwitchedXY=0 bBottomUp=1
GDAL_netCDF: xdim: 129 dfSpacingBegin: 10.000000 dfSpacingMiddle: 10.000000 dfSpacingLast: 10.000000
GDAL_netCDF: ydim: 129 dfSpacingBegin: -10.000000 dfSpacingMiddle: -10.000000 dfSpacingLast: -10.000000
GDAL_netCDF: set bBottomUp = 0 from Y axis
GDAL_netCDF: bGotGeogCS=0 bGotCfSRS=0 bGotCfGT=1 bGotCfWktSRS=0 bGotGdalSRS=1 bGotGdalGT=0
GDAL_netCDF: netcdf type=5 gdal type=6 signedByte=1
GDAL: GDALOpen(NETCDF:/vsicurl/https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785:B04, this=0x55ae81ec3170) succeeds as netCDF.
Driver: netCDF/Network Common Data Format
GDAL: GDALDefaultOverviews::OverviewScan()
Files: /vsicurl/https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785
Size is 129, 129
Coordinate System is:
PROJCRS["WGS 84 / UTM zone 31N",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]]],
    CONVERSION["UTM zone 31N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Longitude of natural origin",3,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["m",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["m",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["easting",east,
            ORDER[1],
            LENGTHUNIT["m",1]],
        AXIS["northing",north,
            ORDER[2],
            LENGTHUNIT["m",1]],
    ID["EPSG",32631]]
Data axis to CRS axis mapping: 1,2
Origin = (701040.000000000000000,5626340.000000000000000)
Pixel Size = (10.000000000000000,-10.000000000000000)
Metadata:
  B04#grid_mapping=crs
  B04#long_name=B04
  B04#units=
  B04#_FillValue=nan
  crs#crs_wkt=PROJCS["WGS 84 / UTM zone 31N", GEOGCS["WGS 84", DATUM["World Geodetic System 1984", SPHEROID["WGS 84", 6378137.0, 298.257223563, AUTHORITY["EPSG","7030"]], AUTHORITY["EPSG","6326"]], PRIMEM["Greenwich", 0.0, AUTHORITY["EPSG","8901"]], UNIT["degree", 0.017453292519943295], AXIS["Geodetic longitude", EAST], AXIS["Geodetic latitude", NORTH], AUTHORITY["EPSG","4326"]], PROJECTION["Transverse_Mercator", AUTHORITY["EPSG","9807"]], PARAMETER["central_meridian", 3.0], PARAMETER["latitude_of_origin", 0.0], PARAMETER["scale_factor", 0.9996], PARAMETER["false_easting", 500000.0], PARAMETER["false_northing", 0.0], UNIT["m", 1.0], AXIS["Easting", EAST], AXIS["Northing", NORTH], AUTHORITY["EPSG","32631"]]
  crs#spatial_ref=PROJCS["WGS 84 / UTM zone 31N", GEOGCS["WGS 84", DATUM["World Geodetic System 1984", SPHEROID["WGS 84", 6378137.0, 298.257223563, AUTHORITY["EPSG","7030"]], AUTHORITY["EPSG","6326"]], PRIMEM["Greenwich", 0.0, AUTHORITY["EPSG","8901"]], UNIT["degree", 0.017453292519943295], AXIS["Geodetic longitude", EAST], AXIS["Geodetic latitude", NORTH], AUTHORITY["EPSG","4326"]], PROJECTION["Transverse_Mercator", AUTHORITY["EPSG","9807"]], PARAMETER["central_meridian", 3.0], PARAMETER["latitude_of_origin", 0.0], PARAMETER["scale_factor", 0.9996], PARAMETER["false_easting", 500000.0], PARAMETER["false_northing", 0.0], UNIT["m", 1.0], AXIS["Easting", EAST], AXIS["Northing", NORTH], AUTHORITY["EPSG","32631"]]
  NC_GLOBAL#Conventions=CF-1.9
  NC_GLOBAL#description=
  NC_GLOBAL#institution=openEO platform - Geotrellis backend: 0.27.0a1
  NC_GLOBAL#title=
  x#long_name=x coordinate of projection
  x#standard_name=projection_x_coordinate
  x#units=m
  y#long_name=y coordinate of projection
  y#standard_name=projection_y_coordinate
  y#units=m
Corner Coordinates:
Upper Left  (  701040.000, 5626340.000) (  5d51' 0.89"E, 50d45'14.29"N)
Lower Left  (  701040.000, 5625050.000) (  5d50'58.35"E, 50d44'32.58"N)
Upper Right (  702330.000, 5626340.000) (  5d52' 6.64"E, 50d45'12.68"N)
Lower Right (  702330.000, 5625050.000) (  5d52' 4.09"E, 50d44'30.97"N)
Center      (  701685.000, 5625695.000) (  5d51'32.49"E, 50d44'52.63"N)
Band 1 Block=129x129 Type=Float32, ColorInterp=Undefined
  NoData Value=nan
  Metadata:
    grid_mapping=crs
    long_name=B04
    NETCDF_VARNAME=B04
    units=
    _FillValue=nan
GDAL: GDALClose(NETCDF:/vsicurl/https://openeo-dev.vito.be/openeo/1.1/jobs/j-2402139ee06e4f088f2cec0cc911339e/results/assets/N2Q1MjMzODEzNzRiNjJlNmYyYWFkMWYyZjlmYjZlZGRmNjI0ZDM4MmE4ZjcxZGI2ZGNmNTc4OGUzYWFlMGFmM0BlZ2kuZXU%3D/821434835ac34118b66c8da71aa04003/openEO_0.nc?expires=1708522785:B04, this=0x55ae81ec3170)
GDAL: In GDALDestroy - unloading GDAL shared library.
bossie commented 4 months ago

The netCDF driver doesn't support Virtual IO (lacks the v flag):

bash-4.4$ gdalinfo --formats | grep -i netcdf
  netCDF -raster,multidimensional raster,vector- (rw+s): Network Common Data Format

On my machine:

bossie@rastapopoulos:~/opt/gdal-3.7.0/installed/bin$ ./gdalinfo --formats | grep -i netcdf
  netCDF -raster,multidimensional raster,vector- (rw+vs): Network Common Data Format

This explains why it's able to read those files from disk just fine but not with /vsicurl.

For reference, gdalinfo --format netCDF should also report:

Supports: Virtual IO - eg. /vsimem/

bossie commented 4 months ago

Bumping into this:

Since GDAL 2.4, and with Linux kernel >=4.3 and libnetcdf >=4.5, read operations on /vsi file systems are supported using the userfaultfd Linux system call. If running from a container, that system call may be unavailable by default. For example with Docker, --security-opt seccomp=unconfined might be needed.

Passing that flag to docker run indeed fixes it.

bossie commented 4 months ago

A more fine grained way to enable the userfaultfd system call is described here and seems to work: https://github.com/LLNL/umap/blob/develop/README.md#example-running-the-umap-container-with-a-seccomp-whitelist

I'm not sure what the consequences are; is this an option @jdries ?

bossie commented 4 months ago

I learned that k8s does allow these system calls by default and indeed, load_stac is able to read netCDF assets and the result can e.g. be saved as a GeoTIFF. Unfortunately, the load_stac-batch job crashes upon completion and is marked as error:

{"message": "Writing results to object storage", "levelname": "INFO", "name": "openeogeotrellis.deploy.batch_job", "created": 1708000014.1691525, "filename": "batch_job.py", "lineno": 1215, "process": 70, "job_id": "j-24021573df2347b4a1c71931f507ecd1", "user_id": "df7ea45d-ecc4-453f-8af9-de8cfb1058b1"}
{"message": "batch_job.py main os.getpid()=70: end 2024-02-15 12:26:56.081776, elapsed 0:00:46.435952", "levelname": "INFO", "name": "openeogeotrellis.deploy.batch_job", "created": 1708000016.0818608, "filename": "util.py", "lineno": 347, "process": 70, "job_id": "j-24021573df2347b4a1c71931f507ecd1", "user_id": "df7ea45d-ecc4-453f-8af9-de8cfb1058b1"}
HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 140338640241344:
  #000: ../../src/H5T.c line 1754 in H5Tclose(): not a datatype
    major: Invalid arguments to routine
    minor: Inappropriate type
[1 of 1000] FAILURE(3) CPLE_AppDefined(1) "Application defined error." netcdf error #-101 : NetCDF: HDF error .
at (/home/jenkins/rpmbuild/BUILD/gdal-3.7.0-fedora/frmts/netcdf/netcdfdataset.cpp,Close,2964)

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fa273752844, pid=15, tid=15
#
# JRE version: OpenJDK Runtime Environment 18.9 (11.0.14+9) (build 11.0.14+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM 18.9 (11.0.14+9-LTS, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libgdalwarp_bindings.so+0x4a844]  std::_Rb_tree<int, std::pair<int const, std::tuple<int, std::chrono::duration<long, std::ratio<1l, 1000l> > > >, std::_Select1st<std::pair<int const, std::tuple<int, std::chrono::duration<long, std::ratio<1l, 1000l> > > > >, std::less<int>, std::allocator<std::pair<int const, std::tuple<int, std::chrono::duration<long, std::ratio<1l, 1000l> > > > > >::_M_begin()+0xc
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /opt/spark/work-dir/core.15)
#
# An error report file with more information is saved as:
# /opt/spark/work-dir/hs_err_pid15.log
#
# If you would like to submit a bug report, please visit:
#   https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%208&component=java-11-openjdk
#

It's not possible to download the result assets in the normal way (not even with ?partial=true) but they do end up in S3 and look as expected.

Sync requests work fine so there's that.

bossie commented 4 months ago

A more fine grained way to enable the userfaultfd system call is described here and seems to work: https://github.com/LLNL/umap/blob/develop/README.md#example-running-the-umap-container-with-a-seccomp-whitelist

I considered enabling this just for batch jobs but I can't find a way to pass this --security-opt to spark-submit either so yeah.

bossie commented 4 months ago

To summarize, at this point:

jdries commented 4 months ago

Possible solutions:

bossie commented 4 months ago

Internal ref: GDD-3173

bossie commented 4 months ago
  • On K8S: call GDALWarp#deinit before batch job end to clean up

Confirmed: works!

bossie commented 3 months ago

Resolution of GDD-3173:

I won't be able to get the seccomp profile working on our current Hadoop cluster due to the outdated kernel on Centos7. I've already implemented the change in the new cluster, but that one is still under development.