dianagudu / motley_cue

A service for mapping OIDC identities to local identities, with local user management
https://motley-cue.readthedocs.io/
MIT License
8 stars 2 forks source link

Worker Stability #67

Open harrypuuter opened 6 months ago

harrypuuter commented 6 months ago

Hello,

I have an odd error, that I am not sure how to debug without going deep into the code of the tool. We have a working setup of motley-cue but I notice, that It happens quite often, that a user login runs into a timeout. In the logs on debug level, I only observe, that a worker (prop the one that is assigned to handling this specific user login) is terminated without any additional information

gunicorn error log

[2024-01-22 09:33:38 +0100] [1222899] [WARNING] Worker with pid 1223052 was terminated due to signal 6

I can retry the log-in a second time, and it works as intended. Once the login worked, it's also consistently working, which makes it a bit hard to debug this issue. We are using the SQLite backend for the mapping of OTP and gunicorn is running with 20 workers. Below, I attached some more logs of the problem.

Is this something that is known to you and points to a configuration error on our side?

Best Wishes Sebastian

use_otp = True
backend = sqlite
db_location = {{ motley_mapping_folder }}/tokenmap.db
keyfile = {{ motley_mapping_folder }}/motley_cue.key

mccli log from a failed login

mccli --debug --no-cache --oidc oidc_account ssh my_moyley_host.edu
info: Trying to get ssh hostname from arguments.
debug: Running this command to get ssh configuration: ssh -G my_moyley_host.edu
debug: Found hostname by parsing command output: my_moyley_host.edu
info: Got host 'my_moyley_host.edu', looking for motley_cue service on host.
info: Looking for motley_cue service at 'https://my_moyley_host.edu'...
info: ...FOUND IT!
info: No access token provided.
info: Using oidc-agent account: oidc_account
info: Requesting token from oidc-agent for account oidc_account with scope openid profile email eduperson_entitlement wlcg wlcg.groups and audience .
debug: Generated one-time password for use with SSH instead of long access token.
error: Expecting value: line 1 column 1 (char 0)

mccli log from successful login

mccli --debug --no-cache --oidc oidc_account ssh my_moyley_host.edu
info: Trying to get ssh hostname from arguments.
debug: Running this command to get ssh configuration: ssh -G my_moyley_host.edu
debug: Found hostname by parsing command output: my_moyley_host.edu
info: Got host 'my_moyley_host.edu', looking for motley_cue service on host.
info: Looking for motley_cue service at 'https://my_moyley_host.edu'...
info: ...FOUND IT!
info: No access token provided.
info: Using oidc-agent account: oidc_account
info: Requesting token from oidc-agent for account oidc_account with scope openid profile email eduperson_entitlement wlcg wlcg.groups and audience .
debug: Generated one-time password for use with SSH instead of long access token.
info: State of your local account: deployed
info: Updating local account...
debug: {
debug:   "state": "deployed",
debug:   "message": "User already existed.",
debug:   "credentials": {
debug:     "description": "Local SSH Test Service",
debug:     "login_help": "Login via `mccli ssh {ssh_host}`.",
debug:     "ssh_host": "my_moyley_host.edu",
debug:     "ssh_user": "sbrommer",
debug:     "commandline": "ssh sbrommer@my_moyley_host.edu"
debug:   }
debug: }
debug: Created OTP [2e822eadea7f8536e28271456109a3dd7f21f22524aa8d642ce118edefbed706faa52bc25ea60777f575b6dafa8efb8d7654040bb3cae49d1efb093eb3c41279] to be used as SSH password.
debug: is a tty

Motley Version

motley-cue.x86_64                         0.6.1-1.el9                          @repo-data-kit-edu
dianagudu commented 6 months ago

Hi there,

Thanks for reporting this. I'll see if I can reproduce this. Could you also post any relevant logs from motley-cue, e.g. in /var/log/motley_cue/mapper.log or /var/log/motley_cue/feudal.log?

harrypuuter commented 6 months ago

Hey,

here are all logs for a failed login attempt. It seems that the user if checked any everything, however this information is not passed along correctly, resulting in a timeout:

feudal.log

[2024-01-22 12:16:27,185] { ...er/backend/hooks.py:11 }   DEBUG - Hook post_create set to /etc/motley_cue/post_account_creation.sh
[2024-01-22 12:16:27,185] { ...ter/backend/ldap.py:627}   DEBUG - backend processing: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 12:16:27,185] { ...ter/backend/ldap.py:648}    INFO - Check if user exists: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F

motley_cue.gunicorn.error

[2024-01-22 12:17:58 +0100] [1946054] [CRITICAL] WORKER TIMEOUT (pid:1946065)
[2024-01-22 12:17:58 +0100] [1946054] [WARNING] Worker with pid 1946065 was terminated due to signal 6
[2024-01-22 12:17:58 +0100] [1960525] [INFO] Booting worker with pid: 1960525
[2024-01-22 12:17:58 +0100] [1960525] [INFO] Started server process [1960525]
[2024-01-22 12:17:58 +0100] [1960525] [INFO] Waiting for application startup.
[2024-01-22 12:17:58 +0100] [1960525] [INFO] Application startup complete.

mapper.log ( I removed the access token and the user info)

[2024-01-22 12:16:25] [flaat] DEBUG - Access token:*removed_accesstoken*
[2024-01-22 12:16:25] [flaat.issuers] DEBUG - Fetching issuer config from: https://iam-darwin.cloud.cnaf.infn.it/.well-known/openid-configuration
[2024-01-22 12:16:25] [flaat.issuers] INFO - Retrieved config for issuer: https://iam-darwin.cloud.cnaf.infn.it/
[2024-01-22 12:16:25] [flaat] DEBUG - Access token is a JWT
[2024-01-22 12:16:25] [flaat.issuers] DEBUG - Fetching issuer config from: https://iam-darwin.cloud.cnaf.infn.it/.well-known/openid-configuration
[2024-01-22 12:16:25] [flaat.issuers] INFO - Retrieved config for issuer: https://iam-darwin.cloud.cnaf.infn.it/
[2024-01-22 12:16:25] [flaat.issuers] DEBUG - Trying to get userinfo from https://iam-darwin.cloud.cnaf.infn.it/userinfo
[2024-01-22 12:16:25] [flaat.issuers] DEBUG - Got userinfo from https://iam-darwin.cloud.cnaf.infn.it/userinfo: {
    "email": "******",
    "email_verified": true,
    "family_name": "**********",
    "given_name": "****",
    "name": "********",
    "preferred_username": "sbrommer",
    "sub": "6c9d07ba-7567-422f-85b2-8dc12b8e3b3d",
    "updated_at": 1699522389,
    "wlcg.groups": [
        "/compute_admin",
        "/production",
        "/production/compute",
        "/production/storage",
        "/user",
        "/user/monitoring",
        "/user/storage"
    ]
}

[2024-01-22 12:16:25] [flaat.issuers] DEBUG - Skipping token introspection because both client_id and client_secret are not configured
[2024-01-22 12:16:25] [flaat.requirements] DEBUG - Required audience empty or not specified.
[2024-01-22 12:16:25] [motley_cue.mapper.token_manager] DEBUG - Storing OTP [c7e80ff6440cf3593df0779ab69fde658e4dc3d09795d013134e255395fb7e58adeab52c919aded1cfe465f501f85c6dd0902380c5cf6dc26788d5248fe5946b] for token [*removed_accesstoken*]
[2024-01-22 12:16:26] [flaat] DEBUG - Access token:*removed_accesstoken*
[2024-01-22 12:16:26] [flaat.issuers] DEBUG - Fetching issuer config from: https://iam-darwin.cloud.cnaf.infn.it/.well-known/openid-configuration
[2024-01-22 12:16:26] [flaat.issuers] INFO - Retrieved config for issuer: https://iam-darwin.cloud.cnaf.infn.it/
[2024-01-22 12:16:26] [flaat] DEBUG - Access token is a JWT
[2024-01-22 12:16:26] [flaat.issuers] DEBUG - Fetching issuer config from: https://iam-darwin.cloud.cnaf.infn.it/.well-known/openid-configuration
[2024-01-22 12:16:26] [flaat.issuers] INFO - Retrieved config for issuer: https://iam-darwin.cloud.cnaf.infn.it/
[2024-01-22 12:16:26] [flaat.issuers] DEBUG - Trying to get userinfo from https://iam-darwin.cloud.cnaf.infn.it/userinfo
[2024-01-22 12:16:27] [flaat.issuers] DEBUG - Got userinfo from https://iam-darwin.cloud.cnaf.infn.it/userinfo: {
    "email": "******",
    "email_verified": true,
    "family_name": "**********",
    "given_name": "****",
    "name": "********",
    "preferred_username": "sbrommer",
    "sub": "6c9d07ba-7567-422f-85b2-8dc12b8e3b3d",
    "updated_at": 1699522389,
    "wlcg.groups": [
        "/compute_admin",
        "/production",
        "/production/compute",
        "/production/storage",
        "/user",
        "/user/monitoring",
        "/user/storage"
    ]
}
[2024-01-22 12:16:27] [flaat.issuers] DEBUG - Skipping token introspection because both client_id and client_secret are not configured
[2024-01-22 12:16:27] [flaat.requirements] DEBUG - Required audience empty or not specified.
[2024-01-22 12:16:27] [ldf_adapter] DEBUG - Hook post_create set to /etc/motley_cue/post_account_creation.sh
[2024-01-22 12:16:27] [ldf_adapter.backend.ldap] DEBUG - backend processing: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 12:16:27] [ldf_adapter.backend.ldap] INFO - Check if user exists: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 12:17:58] [flaat.config] DEBUG - Setting flaat log level to: 10
[2024-01-22 12:17:58] [motley_cue.mapper.token_manager] DEBUG - Key already exists in /etc/motley_cue/mappings/motley_cue.key, nothing to do here.
dianagudu commented 6 months ago

Thanks! Do you also have an error in /var/log/nginx/error.log that looks like this?

[error] 31#31: *9 upstream prematurely closed connection while reading response header from upstream, client: <IP>, server: _, request: "GET /user/deploy HTTP/1.1", upstream: "http://unix:/run/motley_cue/motley-cue.sock:/user/deploy", host: "localhost:8080"

It could an error on the mccli side related to timeouts in this case.

harrypuuter commented 6 months ago

In the nginx logs I indeed see something like this:

10.0.0.2 - - [22/Jan/2024:11:16:22 +0000] "GET / HTTP/1.1" 200 687 "-" "python-requests/2.31.0" "2a02:8071:6410:480:1d80:e93a:8c08:b7e2"
10.0.0.2 - - [22/Jan/2024:11:16:23 +0000] "GET /info/op?url=https://iam-darwin.cloud.cnaf.infn.it/ HTTP/1.1" 200 98 "-" "python-requests/2.31.0" "2a02:8071:6410:480:1d80:e93a:8c08:b7e2"
10.0.0.2 - - [22/Jan/2024:11:16:25 +0000] "GET /user/generate_otp HTTP/1.1" 200 36 "-" "python-requests/2.31.0" "2a02:8071:6410:480:1d80:e93a:8c08:b7e2"
2024/01/22 11:17:26 [error] 27#27: *16 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.0.0.2, server: portal.darwin.kit.edu, request: "GET /user/get_status HTTP/1.1", upstream: "http://unix://run/motley_cue/motley-cue.sock/user/get_status", host: "portal.darwin.kit.edu"
10.0.0.2 - - [22/Jan/2024:11:17:26 +0000] "GET /user/get_status HTTP/1.1" 504 167 "-" "python-requests/2.31.0" "2a02:8071:6410:480:1d80:e93a:8c08:b7e2"
dianagudu commented 6 months ago

OK. I would suggest to first try increasing the various timeout settings to see if the problem persists.

For nginx, you could modify /etc/nginx/sites-available/nginx.motley_cue and add any (or all) of the following:

proxy_connect_timeout 100; # 100s just as an example...
proxy_read_timeout 100;
proxy_send_timeout 100;
keepalive_timeout 75;

Restart nginx afterwards.

Similarly, for gunicorn you can modify /etc/motley_cue/motley_cue.env and then restart motley-cue.

TIMEOUT=120
KEEP_ALIVE=10

It would also help to know where exactly the /get_status request is too slow or blocks, to try and optimise that. Did the logs in in mapper.log end at the point you pasted here:

....
[2024-01-22 12:17:58] [motley_cue.mapper.token_manager] DEBUG - Key already exists in /etc/motley_cue/mappings/motley_cue.key, nothing to do here.
harrypuuter commented 6 months ago

I just ran a test again with the increased timeouts, still getting the same error, these are the logs, essentially, there is nothing logged for the duration of the timeout, only in the very beginning and after the timeout time is passed. In the mapper.log there is nothing after this Key already exists message. See the logs below:

feudal.log

[2024-01-22 17:26:40,247] { ...er/backend/hooks.py:11 }   DEBUG - Hook post_create set to /etc/motley_cue/post_account_creation.sh
[2024-01-22 17:26:40,247] { ...ter/backend/ldap.py:627}   DEBUG - backend processing: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 17:26:40,247] { ...ter/backend/ldap.py:648}    INFO - Check if user exists: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 17:28:27,532] { ...adapter/logsetup.py:102}   DEBUG - Running: ----------------------------------------------------------------------------------------------------
[2024-01-22 17:28:27,532] { ...adapter/logsetup.py:105}   DEBUG -          /usr/lib/motley-cue/bin/gunicorn motley_cue.api:api -k uvicorn.workers.UvicornWorker --config /usr/lib/motley-cue/etc/gunicorn/gunicorn.conf.py
[2024-01-22 17:28:27,532] { ...ter/backend/ldap.py:607}   DEBUG - ................... Initialising LDAP connection...
[2024-01-22 17:28:27,560] { ...ter/backend/ldap.py:183}    INFO - uidNext already initialised: 10002.
[2024-01-22 17:28:27,562] { ...ter/backend/ldap.py:196}    INFO - gidNext already initialised: 10009.

gunicorn

[2024-01-22 16:32:18 +0100] [1995545] [INFO] Application startup complete.
[2024-01-22 17:28:27 +0100] [1995391] [CRITICAL] WORKER TIMEOUT (pid:1995542)
[2024-01-22 17:28:27 +0100] [1995391] [WARNING] Worker with pid 1995542 was terminated due to signal 6
[2024-01-22 17:28:27 +0100] [2001454] [INFO] Booting worker with pid: 2001454
[2024-01-22 17:28:27 +0100] [2001454] [INFO] Started server process [2001454]
[2024-01-22 17:28:27 +0100] [2001454] [INFO] Waiting for application startup.
[2024-01-22 17:28:27 +0100] [2001454] [INFO] Application startup complete.
[2024-01-22 16:32:42] [motley_cue.mapper.local_user_management] INFO - Reached state 'deployed': username sbrommer
[2024-01-22 17:26:39] [flaat] DEBUG - Access token: token
[2024-01-22 17:26:39] [flaat.issuers] DEBUG - Fetching issuer config from: https://iam-darwin.cloud.cnaf.infn.it/.well-known/openid-configuration
[2024-01-22 17:26:39] [flaat.issuers] INFO - Retrieved config for issuer: https://iam-darwin.cloud.cnaf.infn.it/
[2024-01-22 17:26:39] [flaat] DEBUG - Access token is a JWT
[2024-01-22 17:26:39] [flaat.issuers] DEBUG - Trying to get userinfo from https://iam-darwin.cloud.cnaf.infn.it/userinfo
[2024-01-22 17:26:40] [flaat.issuers] DEBUG - Got userinfo from https://iam-darwin.cloud.cnaf.infn.it/userinfo: {
    "email": "**",
    "email_verified": true,
    "family_name": "**",
    "given_name": "**",
    "name": "**",
    "preferred_username": "sbrommer",
    "sub": "6c9d07ba-7567-422f-85b2-8dc12b8e3b3d",
    "updated_at": 1699522389,
    "wlcg.groups": [
        "/compute_admin",
        "/production",
        "/production/compute",
        "/production/storage",
        "/user",
        "/user/monitoring",
        "/user/storage"
    ]
}
[2024-01-22 17:26:40] [flaat.issuers] DEBUG - Skipping token introspection because both client_id and client_secret are not configured
[2024-01-22 17:26:40] [flaat.requirements] DEBUG - Required audience empty or not specified.
[2024-01-22 17:26:40] [motley_cue.mapper.token_manager] DEBUG - Storing OTP [756f1b9fbd33946a9ffe0829bae484ca92b0d808f54a04d0a93d0ae4b615f02efb2171d6035da8faa80bf67b211e0317648e9d0f4a7903af6d49a39fa14295c6] for token [token]
[2024-01-22 17:26:40] [flaat.requirements] DEBUG - Required audience empty or not specified.
[2024-01-22 17:26:40] [ldf_adapter] DEBUG - Hook post_create set to /etc/motley_cue/post_account_creation.sh
[2024-01-22 17:26:40] [ldf_adapter.backend.ldap] DEBUG - backend processing: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 17:26:40] [ldf_adapter.backend.ldap] INFO - Check if user exists: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 17:28:27] [flaat.config] DEBUG - Setting flaat log level to: 10
[2024-01-22 17:28:27] [motley_cue.mapper.token_manager] DEBUG - Key already exists in /etc/motley_cue/mappings/motley_cue.key, nothing to do here.

nginx

10.0.0.2 - - [22/Jan/2024:15:32:42 +0000] "GET /verify_user?username=sbrommer HTTP/1.1" 200 36 "-" "-" "2a00:1398:5:fe04::810d:8c25"
10.0.0.2 - - [22/Jan/2024:16:26:38 +0000] "GET / HTTP/1.1" 200 687 "-" "python-requests/2.31.0" "2a02:8071:6410:480:1d80:e93a:8c08:b7e2"
10.0.0.2 - - [22/Jan/2024:16:26:38 +0000] "GET /info/op?url=https://iam-darwin.cloud.cnaf.infn.it/ HTTP/1.1" 200 98 "-" "python-requests/2.31.0" "2a02:8071:6410:480:1d80:e93a:8c08:b7e2"
10.0.0.2 - - [22/Jan/2024:16:26:40 +0000] "GET /user/generate_otp HTTP/1.1" 200 36 "-" "python-requests/2.31.0" "2a02:8071:6410:480:1d80:e93a:8c08:b7e2"
2024/01/22 16:28:20 [error] 23#23: *8 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.0.0.2, server: portal.darwin.kit.edu, request: "GET /user/get_status HTTP/1.1", upstream: "http://unix://run/motley_cue/motley-cue.sock/user/get_status", host: "portal.darwin.kit.edu"
10.0.0.2 - - [22/Jan/2024:16:28:20 +0000] "GET /user/get_status HTTP/1.1" 504 167 "-" "python-requests/2.31.0" "2a02:8071:6410:480:1d80:e93a:8c08:b7e2"
harrypuuter commented 6 months ago

I just checked, in a successful attempt, the ldap backend retuns a statement that the user exists, and then the account is allowed to login

[2024-01-22 17:45:39] [ldf_adapter.backend.ldap] INFO - Check if user exists: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 17:45:39] [ldf_adapter.backend.ldap] DEBUG - This user does actually exist. The name is: sbrommer and the primary group is: darwin
[2024-01-22 17:45:39] [ldf_adapter.userinfo] INFO - group method: classic
[2024-01-22 17:45:39] [ldf_adapter.userinfo] DEBUG - Using plain groups from 'groups' claim
[2024-01-22 17:45:39] [ldf_adapter.userinfo] INFO - groups: set()
[2024-01-22 17:45:39] [ldf_adapter.userinfo] DEBUG - got grouplist: []
[2024-01-22 17:45:39] [ldf_adapter.backend.ldap] INFO - Check if user exists: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 17:45:39] [ldf_adapter] INFO - Incoming request to reach 'get_status' for user with email: 'None' (6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F) username: sbrommer
[2024-01-22 17:45:39] [ldf_adapter.backend.ldap] INFO - Check if user exists: 6c9d07ba-7567-422f-85b2-8dc12b8e3b3d@https%3A%2F%2Fiam-darwin.cloud.cnaf.infn.it%2F
[2024-01-22 17:45:39] [motley_cue.mapper.local_user_management] INFO - Reached state 'deployed': username sbrommer

Could this be the cause of the timeout ?

dianagudu commented 6 months ago

Indeed, I think in the first case the ldap search operation times out, and the second time it succeeds probably due to caching. I'll need to dig into the feudal-adapter code and see how the performance of ldap queries can be optimised.

dianagudu commented 6 months ago

Do you have any logs from your LDAP server, e.g. which operation times out, as well as some information on how the ldap is configured?

harrypuuter commented 6 months ago

Hi,

I do not have access to the LDAP server, its run by KIT, one of the fels ldap servers. From the sssd logs, I cannot see anything out of the ordinary, but I guess the ldap search operations are done directly via python not via sssd.

Concerning the configuration on the client side, see below the client configurations for sssd and feudal

[domain/ldap]
id_provider = ldap
autofs_provider = ldap
auth_provider = ldap
chpass_provider = ldap
ldap_uri = ldap://fels.scc.kit.edu:389
ldap_search_base = ou=darwin,dc=fels,dc=de
ldap_default_bind_dn = --removed--
ldap_default_authtok = --removed--
cache_credentials = False
ldap_tls_cacertdir = /etc/openldap/cacerts
ldap_tls_reqcert = demand
ldap_id_use_start_tls = True
ldap_user_object_class = posixAccount
ldap_user_name = uid
ldap_user_uid_number = uidNumber
ldap_user_gid_number = gidNumber
ldap_user_search_base = ou=users,ou=darwin,dc=fels,dc=de
ldap_group_search_base = ou=groups,ou=darwin,dc=fels,dc=de
[backend.ldap]

mode = full_access
host = fels.scc.kit.edu
port = 389
admin_user = --removed--
admin_password = --removed--
tls = false
user_base = ou=users,ou=darwin,dc=fels,dc=de
attribute_oidc_uid = gecos
attribute_local_uid = uid
group_base = ou=groups,ou=darwin,dc=fels,dc=de
shell = /bin/bash
home_base = /home
uid_min = 10001
gid_min = 10001
post_create_script = /etc/motley_cue/post_account_creation.sh
harrypuuter commented 5 months ago

I had a closer look into this issue today, and found the issue as well as a solution. The problem is, that this that the the LDAP Connections are opened at the startup of motley (or the feudal part of it) [1] . Since the LDAP instance that we are using is behind a firewall and might also be restarted as some time, there is a high chance, that the connection that was opened at the startup of the worker is not available anymore. In order to avoid this effect, I would propose to add an additional parameter (something like ldap_connection_timeout) and this timeout can be used to restart a new connection. I tested this with a timeout of 10 Minutes, and over the last day, I did not have any login failures anymore. For setups, where the ldap connection is not closed by firewalls, server_restarts etc. this timeout could be disabled (but maybe even there a long timeout would make sense for the overall stability)

maybe there is also a way to check, if the connection is not working anymore within the LDAP package, that could even be a nicer solution to this problem.

[1] https://codebase.helmholtz.cloud/m-team/feudal/feudaladapterldf/-/blob/master/ldf_adapter/backend/ldap.py?ref_type=heads#L152-157

dianagudu commented 5 months ago

Thanks for tracking down the issue! Looking at the ldap3 docs, using SAFE_RESTARTABLE when initiating the connection might fix this.

[1] https://ldap3.readthedocs.io/en/latest/connection.html

harrypuuter commented 5 months ago

I had also tried this, as well as using a ConnectionPool with multiple connections

pool_name: an optional identifier for the Connection pool when using a pooled connection strategy
pool_size: size of the connection pool used in a pooled connection strategy
pool_lifetime: number of seconds before recreating a new connection in a pooled connection strategy
pool_keepalive: number of seconds to wait before sending an Abandon(0) operation in an idle connection in a pooled connection strategy. Abandon(0) is an harmless LDAP operation used to not let the server closing the connection

but both approaches did not result in a successful fix. I think the problem is, that the client does not get any information that the connection was closed and thus it is not restarted or handled in any way

harrypuuter commented 5 months ago

For documentation, this is the modification, that we are now using for nearly two weeks without any problems (with some additional log information)

index e851bbe..791a2d1 100644
--- a/ldf_adapter/backend/ldap.py
+++ b/ldf_adapter/backend/ldap.py
@@ -7,6 +7,8 @@ It"s in the proof-of-concept state.
 # pylint: disable=raise-missing-from, missing-docstring, too-few-public-methods

 import logging
+import time
+import threading
 from ldap3 import (
     AUTO_BIND_NO_TLS,
     SAFE_RESTARTABLE,
@@ -16,6 +18,7 @@ from ldap3 import (
     MODIFY_REPLACE,
     MODIFY_DELETE,
     MODIFY_ADD,
+    SAFE_RESTARTABLE,
     SAFE_SYNC,
     LDIF,
 )
@@ -134,7 +137,13 @@ class LdapConnection:
         self.uid_max = CONFIG.backend.ldap.uid_max
         self.gid_min = CONFIG.backend.ldap.gid_min
         self.gid_max = CONFIG.backend.ldap.gid_max
+        self.refresh_interval = 300
+        self.stop_refresh = threading.Event()
+        threading.Thread(target=self.refresh_connection, daemon=True).start()
+        self.connect()
+

+    def connect(self):
         # initialise connection used to generate LDIFs
         self.ldif_connection = Connection(server=None, client_strategy=LDIF)
         self.ldif_connection.bind()
@@ -165,6 +174,17 @@ class LdapConnection:
             logger.error(f"{msg}: {e}")
             raise Failure(message=msg)

+
+
+    def refresh_connection(self):
+        while not self.stop_refresh.is_set():
+            try:
+                logger.info("Refreshing LDAP connection...")
+                self.connect()
+            except Exception as e:
+                logger.error(f"Error refreshing LDAP connection: {e}")
+            time.sleep(self.refresh_interval)
+
     def init_nextuidgid(self):
         """Initialise uidNext and gidNext entries in FULL_ACCESS mode
         with values starting in configured range.
@@ -204,7 +224,17 @@ class LdapConnection:
             raise Failure(message=msg)

     def search_user_by_oidc_uid(self, oidc_uid, attributes=[]):
-        return LdapSearchResult(
+        logger.debug(f"Searching user by oidc_uid: {oidc_uid}")
+        query = [
+            f"{self.user_base}",
+            f"(&({self.attr_oidc_uid}={oidc_uid})(objectClass=inetOrgPerson)(objectClass=posixAccount))",
+        ]
+        logger.debug(f"Query: {query}")
+        logger.debug(f"Attributes: {attributes}")
+        logger.debug(f"self.user_base: {self.user_base}")
+        logger.debug(f"Starting timer...")
+        timer = time.time()
+        result = LdapSearchResult(
             self.connection,
             [
                 f"{self.user_base}",
@@ -212,6 +242,8 @@ class LdapConnection:
             ],
             {"attributes": attributes},
         )
+        logger.debug(f"Search took {time.time() - timer} seconds.")
+        return result

     def search_user_by_local_username(self, username, get_unique_id=False):
         return LdapSearchResult(
@@ -646,7 +678,9 @@ class User(generic.User):
         If this returns True,  calling `create` should have no effect or raise an error.
         """
         logger.info(f"Check if user exists: {self.unique_id}")
-        return LDAP.search_user_by_oidc_uid(self.unique_id, attributes=[]).found()
+        result = LDAP.search_user_by_oidc_uid(self.unique_id, attributes=[]).found()
+        logger.info(f"User {self.unique_id} exists: {result}")
+        return result

     def name_taken(self, name):
         """Return whether the username is already taken by another user on the service,
dianagudu commented 4 months ago

Thanks! It's great that you found a solution that works for you.

I tested it and it doesn't seem to have any issues. I did implement another solution that makes sure the connection is open before every operation and it seems to work well in my tests. You can try it in the prerelease repository (latest version: 0.7.0~pr173-1 ). It would be great to get some feedback on whether that works in your setup as well!