Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
25 stars 4 forks source link

IllegalBand exception #790

Closed DeRooBert closed 3 weeks ago

DeRooBert commented 3 weeks ago

While running jobs on waw3-1.openeo-vlcc-prod.vgt.vito.be, I got sometimes (10% of jobs) the error that band 9 does not exists (I guess viewAzimuthMean from S2-tile). Can somebody have a look where this problem originates from? ex. j-2406043c04884110a4bad00e9a79aaf8. The error popped after upgrade of openeo on waw3-1.openeo-vlcc-prod (21/05). Before the upgrade this error was not seen. This is urgent for the VLCC production.

bossie commented 3 weeks ago

The error popped after upgrade of openeo on waw3-1.openeo-vlcc-prod (21/05). Before the upgrade this error was not seen.

Could you also provide the ID of a job that was successful but is failing now (i.e. one from before the upgrade)?

DeRooBert commented 3 weeks ago

I don't have that information at the moment, this would me require rerun (random) jobs before the upgrade and then hope the error will show up. I'll give it a shot.

jdries commented 3 weeks ago

Note that this likely has something to do with the readperproduct code path. Only thing is that I actually already fixed something very similar, my guess is that there's still an uncovered edge case.

DeRooBert commented 3 weeks ago

Successful in April: j-240426c6ad0e4dc2aa2454a50194a9f8 Failed now : j-240605db8f004d198057de621673dac3

DeRooBert commented 3 weeks ago

@jdries Can this ticket be raised in priority? Can the following scenario also happen; when only partial band 9 is found and hence no error is thrown? (We see are suddenly decrease in successfull products in a subsequent step of the processing). Here a job-id where we might this issue be happening:j-2405306429094f73aa333edb6d1b0d68

jdries commented 3 weeks ago

yes, trying to do that, but had to move some other things out of the way first it's pretty specific, making it a bit harder to assign a random person

jdries commented 3 weeks ago

It happens in the filter bands process, before the fapar UDF. A large number of tasks do succeed, so it is not consistent, and band 9 is in fact only the last band, so other geometry bands do seem to work fine. Hence we're looking for a case where load_collection for whatever reason decides to not return one of the bands.

jdries commented 3 weeks ago

We need to figure out which band is exactly missing. Band with index 9 is the last one in the list, so looks like only 1 band got lost somewhere for specific tiles. I'm going to remove the filter_bands and then I can put some logging in the fapar udf to figure this out.

jdries commented 3 weeks ago

as expected, logging in the udf shows that most chunks have the correct number of 10 bands I now added an exception to indicate when the number is lower, and print the cube

jdries commented 3 weeks ago

printing the bad cube in udf didn't work out, because a built-in check before that also throws an error because something is wrong with input band labels now exporting cube to netcdf, hoping to see where the problem is situated

jdries commented 3 weeks ago

was able to narrow it down by downloading and inspecting the full 4.9GB input data cube It happens for march 2019, the viewZenithMean band seems to go missing for specific chunks.

When switching band order, it turns out that the band right before the last one goes missing. When saving to file, it is reproducable for netcdf but apparently not for tiff...

{
  "process_graph": {
    "loadcollection1": {
      "process_id": "load_collection",
      "arguments": {
        "bands": [
          "B02",
          "viewAzimuthMean",
          "viewZenithMean",
          "sunAzimuthAngles",
          "sunZenithAngles"
        ],
        "featureflags": {
          "indexreduction": 2,
          "temporalresolution": "ByDay",
          "tilesize": 512
        },
        "id": "SENTINEL2_L2A",
        "properties": {
          "eo:cloud_cover": {
            "process_graph": {
              "lte1": {
                "arguments": {
                  "x": {
                    "from_parameter": "value"
                  },
                  "y": 95
                },
                "process_id": "lte",
                "result": true
              }
            }
          },
          "tileId": {
            "process_graph": {
              "eq1": {
                "arguments": {
                  "x": {
                    "from_parameter": "value"
                  },
                  "y": "30SX*"
                },
                "process_id": "eq",
                "result": true
              }
            }
          }
        },
        "spatial_extent": {
          "east": -0.862851468722435,
          "north": 37.85256342596736,
          "south": 37.73684527335698,
          "west": -1.0294362480744212
        },
        "temporal_extent": [
          "2019-03-01",
          "2019-04-01"
        ]
      }
    },
    "save1": {
      "process_id": "save_result",
      "arguments": {
        "data": {
          "from_node": "loadcollection1"
        },
        "format": "NETCDF"
      },
      "result": true
    }
  },
  "parameters": []
}
jdries commented 3 weeks ago

May have found the issue: angle band names were not always unique. In the rare case where sun and view azimuth angles were similar, this issue could happen.

jdries commented 3 weeks ago

@DeRooBert it seems fixed on staging, can be deployed on vlcc clusters.

DeRooBert commented 3 weeks ago

Ok, Ill ask Thomas to do a redeploy on vlcc clusters