Open-EO / openeo-geopyspark-driver

OpenEO driver for GeoPySpark (Geotrellis)
Apache License 2.0
26 stars 4 forks source link

job tracker failures due to vault login_kerberos failures #359

Open soxofaan opened 1 year ago

soxofaan commented 1 year ago

the minutely job tracker runs fail sometimes (somewhat in bursts) with this error:

Traceback (most recent call last):
  File ".../openeogeotrellis/vault.py", line 83, in login_kerberos
    vault_token = subprocess.check_output(cmd, text=True, stderr=PIPE)
  File "/usr/lib64/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib64/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['vault', 'login', '-address=https://vault.....be', 
  '-token-only', '-method=kerberos', 'username=openeo', 'service=vault-prod', 
  'realm=....BE', 'keytab_path=openeo.keytab', 'krb5conf_path=/etc/krb5.conf']' 
  returned non-zero exit status 2.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".../openeogeotrellis/job_tracker.py", line 439, in main
    etl_api_access_token = None if ConfigParams().is_kube_deploy else get_etl_api_access_token(args.principal,
  File ".../openeogeotrellis/job_tracker.py", line 387, in get_etl_api_access_token
    vault_token = vault.login_kerberos(principal, keytab)
  File ".../openeogeotrellis/vault.py", line 86, in login_kerberos
    raise VaultLoginError(
openeogeotrellis.vault.VaultLoginError: Vault login (Kerberos) failed: Command '['vault', 'login',
  '-address=https://vault.....be', '-token-only', '-method=kerberos', 'username=openeo',
  'service=vault-prod', 'realm=....BE', 'keytab_path=openeo.keytab', 'krb5conf_path=/etc/krb5.conf']' 
  returned non-zero exit status 2.. stderr: "Error authenticating: couldn't log in: 
  [Root cause: Encoding_Error] Encoding_Error: AS Exchange Error: 
  failed to process the AS_REP < Encoding_Error: 
  failed to unmarshal KDC's reply: asn1: syntax error: sequence truncated"
soxofaan commented 1 year ago

This discussion seems relevant: https://github.com/jcmturner/gokrb5/issues/189

I can confirm this was an issue with one of our KDCs being unresponsive.

So seems like kind of network/connection issues we are also seeing with connections to Elastic Search from the job trackers

soxofaan commented 1 year ago

@tcassaert have you seen this in other applications that use Vault?

(cc @bossie you might also be interested in this thread)

soxofaan commented 1 year ago

Just saw another kind of failure from same location:

openeogeotrellis.vault.VaultLoginError: Vault login (Kerberos) failed: 
Command '['vault', 'login', '-address=https://vault.....be', '-token-only', 
'-method=kerberos', 'username=openeo', 'service=vault-prod', 
'realm=...BE', 'keytab_path=openeo.keytab', 'krb5conf_path=/etc/krb5.conf']' 
returned non-zero exit status 2.. stderr: "
Error authenticating: couldn't initialize context: 
[Root cause: Networking_Error] Networking_Error: TGS Exchange Error: 
issue sending TGS_REQ to KDC: failed to communicate with KDC. 
Attempts made with TCP (error in getting a TCP connection to any of the KDCs) 
and then UDP (error sending to a KDC: error sneding to ipa02.....be:88: 
sending over UDP failed to 192.168.207.28:88: 
read udp 172.17.0.4:47658->192.168.207.28:88: 
i/o timeout; error sneding to ipa01.....be:88: 
sending over UDP failed to 192.168.207.29:88: 
read udp 172.17.0.4:49111->192.168.207.29:88: i/o timeout)"

also indicating network/connectivity issues

tcassaert commented 1 year ago

This might be caused by VPN connectivity issues. I will check if I can find anything.