Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 4 forks source link

/result regularly returns empty 200 OK responses on dev with basic auth #418

Closed bossie closed 1 year ago

bossie commented 1 year ago

This request towards dev will return a 200 OK response with an empty body about half the time; this only happens if you use basic auth, with OIDC auth it behaves normally.

connection = openeo.connect("https://openeo-dev.vito.be").authenticate_basic("???", "!!!")

datacube = connection.load_collection(
    "SENTINEL1_GRD",
    spatial_extent={"west": 16.06, "south": 48.06, "east": 16.07, "north": 48.07},
    temporal_extent=["2017-03-01", "2017-03-10"],
    bands=["VV", "VH"],
    properties={"polarization": lambda p: p == "DV"}
).sar_backscatter()

datacube.download("/tmp/backscatter.tiff")
bossie commented 1 year ago

D. De Munck reported this issue as well:

Deze unit test werkt meestal, maar vrijdagavond bv. een hele tijd consistent niet en nu af en toe niet . https://.../projects/GEOM/repos/gmuav-pc/browse/uav-pc-scripts/terrain_follow/unit_test.py?at=refs%2Fheads%2Fdevelopment

soxofaan commented 1 year ago

It looks like the behavior is nicely alternating between success and failure:

import openeo
con = openeo.connect(
    url="openeo-dev.vito.be"
    # url="http://openeo-dev.vgt.vito.be"
    # url="openeo.vito.be"
)
con.authenticate_basic(...)
pg = {"add": {"process_id": "add", "arguments": {"x": 3, "y": 5}, "result": True}}

for i in range(20):
    try:
        res = con.execute(pg)
        print(i, f"great success: {res}")
    except Exception as e:
        print(i, f"fail: {e}")

->

0 great success: 8
1 fail: Expecting value: line 1 column 1 (char 0)
2 great success: 8
3 fail: Expecting value: line 1 column 1 (char 0)
4 great success: 8
5 fail: Expecting value: line 1 column 1 (char 0)
6 great success: 8
7 fail: Expecting value: line 1 column 1 (char 0)
8 great success: 8
9 fail: Expecting value: line 1 column 1 (char 0)
10 great success: 8
11 fail: Expecting value: line 1 column 1 (char 0)
12 great success: 8
13 fail: Expecting value: line 1 column 1 (char 0)
14 great success: 8
15 fail: Expecting value: line 1 column 1 (char 0)
16 great success: 8
17 fail: Expecting value: line 1 column 1 (char 0)
18 great success: 8
19 fail: Expecting value: line 1 column 1 (char 0)

looks like a bad entry in some round-robin load balancer

bossie commented 1 year ago

Probably related: creating jobs on dev works fine but starting them returns a 200 OK (without actually doing anything) instead of an expected 202 Accepted; the OpenEO Python client raises an error because of this unexpected response code. Again, there is a problem with basic auth as OIDC auth works fine.

Run e.g. tests.test_integration.test_batch_job_execute_batch against https://openeo-dev.vito.be to reproduce.

The current theory is that a Traefik plugin fails to contact an API and returns a 200 OK without actually forwarding the request to OpenEO and returning its response.

bossie commented 1 year ago

Traefik dashboards (page /OpenEO in Confluence): ssh -L 8080:localhost:8080 traefik01..., then visit http://localhost:8080/dashboard/

bossie commented 1 year ago

Below are /result requests with basic auth towards individual Traefik nodes, their responses and the corresponding entries in their access logs, including the router Traefik picked to service the request.

bossie@rastapopoulos:~$ curl --location 'http://traefik01-dev/openeo/1.0/result' --header 'Content-Type: application/json' --header 'Host: openeo-dev.vito.be' --header 'Authorization: Bearer basic//...' --data '{
    "process": {
        "process_graph": {
            "add": {
                "process_id": "add",
                "arguments": {
                    "x": 3,
                    "y": 5
                },
                "result": true
            }
        }
    }
}'

172.24.237.254 - - [11/May/2023:19:53:33 +0000] "POST /openeo/1.0/result HTTP/1.1" 0 0 "-" "-" 3888 "openeo-dev-results@zookeeper" "-" 0ms
bossie@rastapopoulos:~$ curl --location 'http://traefik02-dev/openeo/1.0/result' \
--header 'Content-Type: application/json' \
--header 'Host: openeo-dev.vito.be' \
--header 'Authorization: Bearer basic//...' \
--data '{
    "process": {
        "process_graph": {
            "add": {
                "process_id": "add",
                "arguments": {
                    "x": 3,
                    "y": 5
                },
                "result": true
            }
        }
    }
}'

172.24.237.254 - - [11/May/2023:19:54:58 +0000] "POST /openeo/1.0/result HTTP/1.1" 0 0 "-" "-" 291568 "openeo-dev-results@zookeeper" "-" 0ms
bossie@rastapopoulos:~$ curl --location 'http://traefik01/openeo/1.0/result' --header 'Content-Type: application/json' --header 'Host: openeo.vito.be' --header 'Authorization: Bearer basic//...' --data '{
    "process": {
        "process_graph": {
            "add": {
                "process_id": "add",
                "arguments": {
                    "x": 3,
                    "y": 5
                },
                "result": true
            }
        }
    }
}'
8

172.24.237.254 - - [11/May/2023:19:58:07 +0000] "POST /openeo/1.0/result HTTP/1.1" 200 2 "-" "-" 154685 "openeo-prod-credits-check@zookeeper" "http://192.168.207.61:56877" 1266ms
bossie@rastapopoulos:~$ curl --location 'http://traefik02/openeo/1.0/result' --header 'Content-Type: application/json' --header 'Host: openeo.vito.be' --header 'Authorization: Bearer basic//...' --data '{
    "process": {
        "process_graph": {
            "add": {
                "process_id": "add",
                "arguments": {
                    "x": 3,
                    "y": 5
                },
                "result": true
            }
        }
    }
}'
8

172.24.237.254 - - [11/May/2023:20:00:26 +0000] "POST /openeo/1.0/result HTTP/1.1" 200 2 "-" "-" 152870 "openeo-prod-credits-check@zookeeper" "http://192.168.207.61:56877" 288ms

The current situation is that:

The particular router that the traefik-dev nodes pick, tends to vary because in this case there are two rules that overlap with the same priority of 100: openeo-dev and openeo-dev-results; if it picks openeo-dev, no middlewares are invoked and the request returns 8, whereas if it picks openeo-dev-results, 4 middlewares are invoked, they fail for basic auth and the request returns an empty response.

The particular router that the traefik-prod nodes pick, is always the same because its rule has the highest priority of 194 (the rule's string length): openeo-prod-credits-check; this router will only invoke the credits_check_prod middleware, it succeeds for basic auth and the request returns 8.

So the difference in behaviour between dev and prod is explained by the fact that dev has overlapping rules with the same priority and the Traefik node will just pick one. This also explains why the dev responses might alternate between good and bad (round-robin between e.g. traefik01-dev that picks the openeo-dev rule and traefik02-dev that picks the openeo-dev-results rule) This also explains why targeting openeo-dev.vito.be had bad behaviour while targeting traefik01-dev directly had good behaviour and I incorrectly concluded that the problem was in a downstream proxy (by sheer luck traefik01 picked the openeo-dev rule)

tl;dr

Also: https://github.com/Open-EO/openeo-geopyspark-driver/issues/346

bossie commented 1 year ago

Confirmed: raising the priority of openeo-dev-credits-check from 99 to 1000 routes the requests to openeo-dev-credits-check and makes openeo-dev.vito.be return valid responses for basic auth.

172.24.236.7 - - [12/May/2023:04:52:30 +0000] "POST /openeo/1.1/result HTTP/1.1" 200 2 "-" "-" 6589 "openeo-dev-credits-check@zookeeper" "http://192.168.207.55:57975" 210ms

Since the credits_check plugin works, the problem is elsewhere.

bossie commented 1 year ago

Identified a bug in the add_identifier_header plugin wrt/ basic auth.

Confirmed (sudo journalctl -u traefik -f):

May 12 09:39:40 traefik01-dev.vgt.vito.be traefik[4840]: [05/Jun/2023:14:08:55] ERROR(add_identifier_header): invalid token format
bossie commented 1 year ago

Fixed add_identifier_header plugin by:

bossie commented 1 year ago

Applying the fix on dev shows that the openeo-dev-results route now works and returns a proper response. Batch jobs are fixed as well.

172.24.236.7 - - [12/May/2023:07:50:19 +0000] "POST /openeo/1.1/result HTTP/1.1" 200 2 "-" "-" 19 "openeo-dev-results@zookeeper" "http://192.168.207.55:57975" 184ms
bossie commented 1 year ago

TODO:

bossie commented 1 year ago

The credits_check plugin fails for basic auth but this error is swept under the rug:

May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(add_identifier_header): Authorization Header is: [Bearer basic//...]
May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Logging result request at: etl.proba-v-mep.esa.int
May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] INFO(credits_check): Log result request service HTTP response code: 500
May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Credits check service url: etl.proba-v-mep.esa.int
May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Credits check service HTTP response code: 500

Seems to work for for OIDC auth:

May 12 10:20:00 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(add_identifier_header): Authorization Header is: [Bearer oidc/egi/...]
May 12 10:20:00 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(add_identifier_header): Unique identifier is: 7d523381374b62e6f2aad1f2f9fb6eddf624d382a8f71db6dcf5788e3aae0af3@egi.eu
May 12 10:20:00 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Logging result request at: etl.proba-v-mep.esa.int
May 12 10:20:05 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] INFO(credits_check): Log result request service HTTP response code: 201
May 12 10:20:05 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Credits check service url: etl.proba-v-mep.esa.int
bossie commented 1 year ago

Znodes are configured in traefik_config/terraform/zk.tfvars; run https://jenkins/job/GDD/job/traefik_config/ to apply.

bossie commented 1 year ago

Applied everything on dev and prod.