HEPCloud / decisionengine_modules

Apache License 2.0
2 stars 19 forks source link

Decision engine 1.7.3 bug too many open file descriptors in glide_frontend_element.py #428

Closed StevenCTimm closed 2 years ago

StevenCTimm commented 2 years ago

The resource_request channel is repeatedly dying with the following exception:

022-07-28 13:55:34,261 - channel - glide_frontend_element - WARNING - {"channel": "", "class_module": "glide_frontend_element", "event": "failed to create /var/lib/gwms-frontend/tokens.d/NERSC-Perlmutter-GPU.idtoken: Error running 'condor_config_val TRUST_DOMAIN'\nStdout:\nStderr:\nException OSError:[Errno 24] Too many open files", "level": "warning", "logger": "channel", "timestamp": "2022-07-28 13:55:34"} 2022-07-28 13:55:34,262 - channel - glide_frontend_element - WARNING - {"channel": "", "class_module": "glide_frontend_element", "event": "Error running 'condor_config_val TRUST_DOMAIN'\nStdout:\nStderr:\nException OSError:[Errno 24] Too many open files", "level": "warning", "logger": "channel", "timestamp": "2022-07-28 13:55:34"} 2022-07-28 13:55:34,969 - channel - glidein_requests - ERROR - {"channel": "resource_request", "class_module": "glidein_requests", "class_name": "GlideinRequestManifests", "event": "Error generating glidein requests", "exception": "Traceback (most recent call last):\n File \"/usr/lib/python3.6/site-packages/glideinwms/lib/subprocessSupport.py\", line 93, in iexe_cmd\n universal_newlines=text,\n File \"/usr/lib64/python3.6/subprocess.py\", line 729, in init\n restore_signals, start_new_session)\n File \"/usr/lib64/python3.6/subprocess.py\", line 1254, in _execute_child\n errpipe_read, errpipe_write = os.pipe()\nOSError: [Errno 24] Too many open files\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/lib/python3.6/site-packages/glideinwms/lib/condorExe.py\", line 159, in iexe_cmd\n stdout_data = subprocessSupport.iexe_cmd(cmd, stdin_data=stdin_data, child_env=child_env)\n File \"/usr/lib/python3.6/site-packages/glideinwms/lib/subprocessSupport.py\", line 110, in iexe_cmd\n raise RuntimeError(err_str % (cmd, stdoutdata, stderrdata, e))\nRuntimeError: Error running 'openssl x509 -in /etc/gwms-frontend/proxies/cms_proxy -noout -enddate'\nStdout:\nStderr:\nException OSError:[Errno 24] Too many open files\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/transforms/glidein_requests.py\", line 138, in transform\n job_filter=job_query, fom_entries=fom_entries)\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/glide_frontend_element.py\", line 1399, in generate_glidein_requests\n factory_globals, job_filter=job_filter)\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/glide_frontend_element.py\", line 1564, in generate_glidein_requests_one\n self.create_glideclientglobal_classads(row, key_builder))\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/glide_frontend_element.py\", line 658, in create_glideclientglobal_classads\n cred.renew()\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/security.py\", line 203, in renew\n remaining = self.time_left()\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/security.py\", line 189, in time_left\n \"openssl x509 -in %s -noout -enddate\" % self.filename)\n File \"/usr/lib/python3.6/site-packages/glideinwms/lib/condorExe.py\", line 167, in iexe_cmd\n raise ExeError(msg)\nglideinwms.lib.condorExe.ExeError: Unexpected Error running 'openssl x509 -in /etc/gwms-frontend/proxies/cms_proxy -noout -enddate'. Details: Error running 'openssl x509 -in /etc/gwms-frontend/proxies/cms_proxy -noout -enddate'\nStdout:\nStderr:\nException OSError:[Errno 24] Too many open files. Stdout: ", "level": "error", "logger": "channel", "timestamp": "2022-07-28 13:55:34"} 2022-07-28 13:55:35,071 - channel - TaskManager - ERROR - {"channel": "resource_request", "event": "exception from transform GlideinRequestManifests ", "exception": "Traceback (most recent call last):\n File \"/usr/lib/python3.6/site-packages/glideinwms/lib/subprocessSupport.py\", line 93, in iexe_cmd\n universal_newlines=text,\n File \"/usr/lib64/python3.6/subprocess.py\", line 729, in init\n restore_signals, start_new_session)\n File \"/usr/lib64/python3.6/subprocess.py\", line 1254, in _execute_child\n errpipe_read, errpipe_write = os.pipe()\nOSError: [Errno 24] Too many open files\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/lib/python3.6/site-packages/glideinwms/lib/condorExe.py\", line 159, in iexe_cmd\n stdout_data = subprocessSupport.iexe_cmd(cmd, stdin_data=stdin_data, child_env=child_env)\n File \"/usr/lib/python3.6/site-packages/glideinwms/lib/subprocessSupport.py\", line 110, in iexe_cmd\n raise RuntimeError(err_str % (cmd, stdoutdata, stderrdata, e))\nRuntimeError: Error running 'openssl x509 -in /etc/gwms-frontend/proxies/cms_proxy -noout -enddate'\nStdout:\nStderr:\nException OSError:[Errno 24] Too many open files\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py\", line 457, in run_transform\n data = transform.worker.transform(data_block)\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/transforms/glidein_requests.py\", line 138, in transform\n job_filter=job_query, fom_entries=fom_entries)\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/glide_frontend_element.py\", line 1399, in generate_glidein_requests\n factory_globals, job_filter=job_filter)\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/glide_frontend_element.py\", line 1564, in generate_glidein_requests_one\n self.create_glideclientglobal_classads(row, key_builder))\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/glide_frontend_element.py\", line 658, in create_glideclientglobal_classads\n cred.renew()\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/security.py\", line 203, in renew\n remaining = self.time_left()\n File \"/usr/lib/python3.6/site-packages/decisionengine_modules/glideinwms/security.py\", line 189, in time_left\n \"openssl x509 -in %s -noout -enddate\" % self.filename)\n File \"/usr/lib/python3.6/site-packages/glideinwms/lib/condorExe.py\", line 167, in iexe_cmd\n raise ExeError(msg)\nglideinwms.lib.condorExe.ExeError: Unexpected Error running 'openssl x509 -in /etc/gwms-frontend/proxies/cms_proxy -noout -enddate'. Details: Error running 'openssl x509 -in /etc/gwms-frontend/proxies/cms_proxy -noout -enddate'\nStdout:\nStderr:\nException OSError:[Errno 24] Too many open files. Stdout: ", "level": "error", "logger": "channel", "module": "TaskManager", "timestamp": "2022-07-28 13:55:34"}

I have already modified the systemd unit file

Jul 26 17:09 decisionengine.service [root@fermicloud435 system]# cat decisionengine.service [Unit] Description = Decision EngineService Wants = network-online.target After = network.target

[Service] Type = simple LimitNOFILE = 16384 EnvironmentFile = -/etc/sysconfig/decisionengine User = decisionengine Group = decisionengine ExecStart = /usr/bin/decisionengine ExecStop = /usr/bin/echo "Running /usr/bin/de-client --stop" >&2 ExecStop = /usr/bin/de-client --stop

[Install]

To increase the number of file descriptors for user decisionengine.

This helped it run for about a day rather than just a couple of hours.

Whenever it fails it is failing at the same section of code, where it is checking the expiration date of the various proxies before it gets ready to advertise the credentials to the factory.

Right after startup the DE has about 12000 open file descriptors.

Compare with cmsde01 (1.7.0) which has 9972 overall.

It is not clear if the systemd unit correctly passes its max nofile limit to all the subprocesses of the decision engine.. we should be able to poke around in the process studies and find out more.

StevenCTimm commented 2 years ago

the resource_request channel when just restarted, that process alone, takes about 320 file descriptors when first open, will watch and see if it grows/leaks.

StevenCTimm commented 2 years ago

on fermicloud435 the decision engine has been running for a week now. There are currently 51979 open file descriptors on process 1328 which is the resource_request channel.

File descriptors show

decisione 1328 decisionengine 643u REG 253,3 0 18276621 /tmp/tmpwyknyx0r (deleted) decisione 1328 decisionengine 644u REG 253,3 0 18276622 /tmp/tmpmrrj160h (deleted) decisione 1328 decisionengine 645u REG 253,3 0 18276623 /tmp/tmpxeiom4eg (deleted) decisione 1328 decisionengine 646u REG 253,3 0 18276626 /tmp/tmpyhei8kml (deleted) decisione 1328 decisionengine *647u REG 253,3 0 18276627 /tmp/tmpdm7hja62 (deleted)

These are all deleted files and most of them appear to being generated by glideinwms/security.py in decisionengine_modules namely the part which is checking the expiration of all the proxy files.

That is always where the code crashes.

The number of open descriptors is increasing by several per minutes.

StevenCTimm commented 2 years ago

as many 10 file descriptors per minute, The resource_request channel was restarted 1 week ago today, it has gone from 320 file descriptors to 52026 and is adding as many as 10 file descriptors per minute, but not every minute.

decisionengine_modules/glideinwms [root@fermicloud435 glideinwms]# while true ; do lsof -p 1328 | wc -l ; sleep 60 ; done 52016 52026 52028 52037 52043

mambelli commented 2 years ago

Temporary files were not closed when the credential generation was failing. Fixed