Closed bossie closed 1 year ago
D. De Munck reported this issue as well:
Deze unit test werkt meestal, maar vrijdagavond bv. een hele tijd consistent niet en nu af en toe niet . https://.../projects/GEOM/repos/gmuav-pc/browse/uav-pc-scripts/terrain_follow/unit_test.py?at=refs%2Fheads%2Fdevelopment
It looks like the behavior is nicely alternating between success and failure:
import openeo
con = openeo.connect(
url="openeo-dev.vito.be"
# url="http://openeo-dev.vgt.vito.be"
# url="openeo.vito.be"
)
con.authenticate_basic(...)
pg = {"add": {"process_id": "add", "arguments": {"x": 3, "y": 5}, "result": True}}
for i in range(20):
try:
res = con.execute(pg)
print(i, f"great success: {res}")
except Exception as e:
print(i, f"fail: {e}")
->
0 great success: 8
1 fail: Expecting value: line 1 column 1 (char 0)
2 great success: 8
3 fail: Expecting value: line 1 column 1 (char 0)
4 great success: 8
5 fail: Expecting value: line 1 column 1 (char 0)
6 great success: 8
7 fail: Expecting value: line 1 column 1 (char 0)
8 great success: 8
9 fail: Expecting value: line 1 column 1 (char 0)
10 great success: 8
11 fail: Expecting value: line 1 column 1 (char 0)
12 great success: 8
13 fail: Expecting value: line 1 column 1 (char 0)
14 great success: 8
15 fail: Expecting value: line 1 column 1 (char 0)
16 great success: 8
17 fail: Expecting value: line 1 column 1 (char 0)
18 great success: 8
19 fail: Expecting value: line 1 column 1 (char 0)
looks like a bad entry in some round-robin load balancer
Probably related: creating jobs on dev works fine but starting them returns a 200 OK (without actually doing anything) instead of an expected 202 Accepted; the OpenEO Python client raises an error because of this unexpected response code. Again, there is a problem with basic auth as OIDC auth works fine.
Run e.g. tests.test_integration.test_batch_job_execute_batch
against https://openeo-dev.vito.be to reproduce.
The current theory is that a Traefik plugin fails to contact an API and returns a 200 OK without actually forwarding the request to OpenEO and returning its response.
Traefik dashboards (page /OpenEO in Confluence): ssh -L 8080:localhost:8080 traefik01...
, then visit http://localhost:8080/dashboard/
Below are /result requests with basic auth towards individual Traefik nodes, their responses and the corresponding entries in their access logs, including the router Traefik picked to service the request.
bossie@rastapopoulos:~$ curl --location 'http://traefik01-dev/openeo/1.0/result' --header 'Content-Type: application/json' --header 'Host: openeo-dev.vito.be' --header 'Authorization: Bearer basic//...' --data '{
"process": {
"process_graph": {
"add": {
"process_id": "add",
"arguments": {
"x": 3,
"y": 5
},
"result": true
}
}
}
}'
172.24.237.254 - - [11/May/2023:19:53:33 +0000] "POST /openeo/1.0/result HTTP/1.1" 0 0 "-" "-" 3888 "openeo-dev-results@zookeeper" "-" 0ms
bossie@rastapopoulos:~$ curl --location 'http://traefik02-dev/openeo/1.0/result' \
--header 'Content-Type: application/json' \
--header 'Host: openeo-dev.vito.be' \
--header 'Authorization: Bearer basic//...' \
--data '{
"process": {
"process_graph": {
"add": {
"process_id": "add",
"arguments": {
"x": 3,
"y": 5
},
"result": true
}
}
}
}'
172.24.237.254 - - [11/May/2023:19:54:58 +0000] "POST /openeo/1.0/result HTTP/1.1" 0 0 "-" "-" 291568 "openeo-dev-results@zookeeper" "-" 0ms
bossie@rastapopoulos:~$ curl --location 'http://traefik01/openeo/1.0/result' --header 'Content-Type: application/json' --header 'Host: openeo.vito.be' --header 'Authorization: Bearer basic//...' --data '{
"process": {
"process_graph": {
"add": {
"process_id": "add",
"arguments": {
"x": 3,
"y": 5
},
"result": true
}
}
}
}'
8
172.24.237.254 - - [11/May/2023:19:58:07 +0000] "POST /openeo/1.0/result HTTP/1.1" 200 2 "-" "-" 154685 "openeo-prod-credits-check@zookeeper" "http://192.168.207.61:56877" 1266ms
bossie@rastapopoulos:~$ curl --location 'http://traefik02/openeo/1.0/result' --header 'Content-Type: application/json' --header 'Host: openeo.vito.be' --header 'Authorization: Bearer basic//...' --data '{
"process": {
"process_graph": {
"add": {
"process_id": "add",
"arguments": {
"x": 3,
"y": 5
},
"result": true
}
}
}
}'
8
172.24.237.254 - - [11/May/2023:20:00:26 +0000] "POST /openeo/1.0/result HTTP/1.1" 200 2 "-" "-" 152870 "openeo-prod-credits-check@zookeeper" "http://192.168.207.61:56877" 288ms
The current situation is that:
openeo-dev-results
router and return 200 OK responses that are empty (bad);openeo-prod-credits-check
router and return 200 OK responses that contain the value 8 (good).The particular router that the traefik-dev nodes pick, tends to vary because in this case there are two rules that overlap with the same priority of 100: openeo-dev
and openeo-dev-results
; if it picks openeo-dev
, no middlewares are invoked and the request returns 8, whereas if it picks openeo-dev-results
, 4 middlewares are invoked, they fail for basic auth and the request returns an empty response.
The particular router that the traefik-prod nodes pick, is always the same because its rule has the highest priority of 194 (the rule's string length): openeo-prod-credits-check
; this router will only invoke the credits_check_prod
middleware, it succeeds for basic auth and the request returns 8.
So the difference in behaviour between dev and prod is explained by the fact that dev has overlapping rules with the same priority and the Traefik node will just pick one.
This also explains why the dev responses might alternate between good and bad (round-robin between e.g. traefik01-dev that picks the openeo-dev
rule and traefik02-dev that picks the openeo-dev-results
rule)
This also explains why targeting openeo-dev.vito.be had bad behaviour while targeting traefik01-dev directly had good behaviour and I incorrectly concluded that the problem was in a downstream proxy (by sheer luck traefik01 picked the openeo-dev
rule)
tl;dr
Also: https://github.com/Open-EO/openeo-geopyspark-driver/issues/346
Confirmed: raising the priority of openeo-dev-credits-check
from 99 to 1000 routes the requests to openeo-dev-credits-check
and makes openeo-dev.vito.be return valid responses for basic auth.
172.24.236.7 - - [12/May/2023:04:52:30 +0000] "POST /openeo/1.1/result HTTP/1.1" 200 2 "-" "-" 6589 "openeo-dev-credits-check@zookeeper" "http://192.168.207.55:57975" 210ms
Since the credits_check
plugin works, the problem is elsewhere.
Identified a bug in the add_identifier_header
plugin wrt/ basic auth.
Confirmed (sudo journalctl -u traefik -f
):
May 12 09:39:40 traefik01-dev.vgt.vito.be traefik[4840]: [05/Jun/2023:14:08:55] ERROR(add_identifier_header): invalid token format
Fixed add_identifier_header
plugin by:
basic.
bearer token).Applying the fix on dev shows that the openeo-dev-results
route now works and returns a proper response. Batch jobs are fixed as well.
172.24.236.7 - - [12/May/2023:07:50:19 +0000] "POST /openeo/1.1/result HTTP/1.1" 200 2 "-" "-" 19 "openeo-dev-results@zookeeper" "http://192.168.207.55:57975" 184ms
TODO:
openeo-dev-credits-check
in favor of openeo-dev-results
openeo-prod-credits-check
in favor of openeo-prod-results
credits_check plugin
for basic auth (see below)The credits_check
plugin fails for basic auth but this error is swept under the rug:
May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(add_identifier_header): Authorization Header is: [Bearer basic//...]
May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Logging result request at: etl.proba-v-mep.esa.int
May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] INFO(credits_check): Log result request service HTTP response code: 500
May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Credits check service url: etl.proba-v-mep.esa.int
May 12 10:18:30 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Credits check service HTTP response code: 500
Seems to work for for OIDC auth:
May 12 10:20:00 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(add_identifier_header): Authorization Header is: [Bearer oidc/egi/...]
May 12 10:20:00 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(add_identifier_header): Unique identifier is: 7d523381374b62e6f2aad1f2f9fb6eddf624d382a8f71db6dcf5788e3aae0af3@egi.eu
May 12 10:20:00 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Logging result request at: etl.proba-v-mep.esa.int
May 12 10:20:05 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] INFO(credits_check): Log result request service HTTP response code: 201
May 12 10:20:05 traefik02-dev.vgt.vito.be traefik[7401]: [05/Jun/2023:09:47:21] DEBUG(credits_check): Credits check service url: etl.proba-v-mep.esa.int
Znodes are configured in traefik_config/terraform/zk.tfvars; run https://jenkins/job/GDD/job/traefik_config/ to apply.
Applied everything on dev and prod.
This request towards dev will return a 200 OK response with an empty body about half the time; this only happens if you use basic auth, with OIDC auth it behaves normally.