cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
335 stars 94 forks source link

Issue/Question: HTTP 500 errors logged in `job.err` when running a suite under an Active Directory account on RHEL #4492

Open andrew-kelleher opened 3 years ago

andrew-kelleher commented 3 years ago

Describe the bug I'm currently testing Cylc on a RedHat 8.1 virtual machine hosted in Azure.

I've installed Cylc 7.9.5, and have successfully submitted an example "Hello world" suite from the VM itself. All of the logs and outputs show that it ran successfully.

This works for any of the local accounts on the VM.

The VM is also Active Directory joined using Realm. The issue I'm having is that when running the same suite from the VM using an AD account I'm seeing the following errors in job.err -

2021-11-01T12:29:20Z WARNING - Message send failed, try 1 of 7: Cannot connect: http://RHELAK01-vm:43081/put_messages: HTTP Error 500: Internal Server Error

This error is repeated 7 times, every 30 seconds. Running cylc scan returns nothing and the suite gets stuck in a hung state until I delete the contact file.

Despite the errors, the suite has run successfully and job.out is showing the expected output.

Release version(s) and/or repository branch(es) affected? Same behaviour has been seen on Cylc 7.9.5 and 7.9.2, and RHEL 8.1 and 8.2.

Steps to reproduce the bug

  1. Deploy a RHEL VM,
  2. Join VM to an Active Directory domain using Realm
  3. Install Cylc
  4. Login to the VM as an AD user and create the ~/.cylc/global.rc file
  5. Create a test suite.rc file and register it i.e. cylc register mysuite /tut/suite.rc
  6. Run the suite i.e. cylc run mysuite --host=localhost

Expected behavior

  1. Running cylc scan should show the currently running suite
  2. the job.err file should not be showing HTTP 500 errors
  3. The suite should gracefully finish and not require the contact file deleting

Additional context Troubleshooting steps completed so far include -

As far as I can tell there's no difference in configuration between the local VM accounts and the AD accounts, they're members of the same local groups etc.

This may not be a bug per se but are there any known compatibility issues when submitting a job as an AD user? Or additional configuration required?

hjoliver commented 3 years ago

Hi @andrew-kelleher - thanks for posting this report!

I'm pretty sure this is not a Cylc bug. Evidently job status messaging, which goes over http(s) for Cylc 7, is being blocked on AD accounts but not local accounts. I don't know if AD users can be subject to different firewall rules (but you say you've disable the firewall anyway).

Can you try to reproduce this without Cylc? Start a web server of some kind on the same port, and try to connect to it from both account types.

I can at least explain Cylc server (scheduler program) behaviour under this circumstance (blocked job status messaging):

hjoliver commented 3 years ago

I just noticed your error message was 500: Internal Server Error, which does suggest the suite server is to blame. But it's not a connection denied e.g. from the client (job in this case) not being able read the auth file (for that, you get a more informative 403 in the job.err). Might be worth running in debug mode and checking the suite server (not job) log.

I don't know if AD users can be subject to different firewall rules

And ...apparently that's not possible.

andrew-kelleher commented 3 years ago

Hi @hjoliver many thanks for the response.

On this VM I was able to successfully connect to a local instance of Nginx using both a local and AD account. So I don't believe it's network related.

Interestingly, if I attempt to run cyclc stop --now s1 using an AD account I receive a similar 500 error -

Cannot connect: http://RHELAK01-vm:43094/stop_now: HTTP Error 500: Internal Server Error

The suite log file when run as an AD user looks like this. It gets stuck in the health check state -

[andrew@domain.local@RHELAK01-vm suite]$ cat log.20211108T162550Z 
2021-11-08T16:25:50Z DEBUG - Reading file /shared/home/andrew@domain.local/tut/suite.rc
2021-11-08T16:25:50Z DEBUG - Processed configuration dumped: /shared/home/andrew@domain.local/cylc-run/s1/suite.rc.processed
2021-11-08T16:25:50Z DEBUG - Expanding [runtime] namespace lists and parameters
2021-11-08T16:25:50Z DEBUG - Parsing the runtime namespace hierarchy
2021-11-08T16:25:50Z DEBUG - Parsing [special tasks]
2021-11-08T16:25:50Z DEBUG - Parsing the dependency graph
2021-11-08T16:25:50Z DEBUG - Configuring internal queues
2021-11-08T16:25:50Z INFO - Suite server: url=http://RHELAK01-vm:43034/ pid=1012178
2021-11-08T16:25:50Z INFO - Run: (re)start=0 log=1
2021-11-08T16:25:50Z INFO - Cylc version: 7.9.2
2021-11-08T16:25:50Z INFO - Run mode: live
2021-11-08T16:25:50Z INFO - Initial point: 1
2021-11-08T16:25:50Z INFO - Final point: 1
2021-11-08T16:25:50Z INFO - Cold Start 1
2021-11-08T16:25:50Z DEBUG - [hello.1] -released to the task pool
2021-11-08T16:25:50Z DEBUG - BEGIN TASK PROCESSING
2021-11-08T16:25:50Z DEBUG - [hello.1] -waiting => queued
2021-11-08T16:25:50Z DEBUG - 1 task(s) de-queued
2021-11-08T16:25:50Z INFO - [hello.1] -submit-num=01, owner@host=RHELAK01-vm
2021-11-08T16:25:50Z DEBUG - ['cylc', 'jobs-submit', '--debug', '--', '/shared/home/andrew@domain.local/cylc-run/s1/log/job'] ... # wi
ll invoke in batches, sizes=[1]
2021-11-08T16:25:50Z DEBUG - [hello.1] -queued => ready
2021-11-08T16:25:50Z DEBUG - END TASK PROCESSING (took 0.0403678417206 seconds)
2021-11-08T16:25:50Z DEBUG - Performing suite health check
2021-11-08T16:25:50Z DEBUG - Loading site/user global config files
2021-11-08T16:25:50Z DEBUG - Reading file /shared/home/andrew@domain.local/.cylc/global.rc
2021-11-08T16:25:51Z DEBUG - ['cylc', 'jobs-submit', '--debug', '--', '/shared/home/andrew@domain.local/cylc-run/s1/log/job', '1/hello
/01']
2021-11-08T16:25:51Z DEBUG - Performing suite health check
2021-11-08T16:25:51Z DEBUG - [jobs-submit cmd] cylc jobs-submit --debug -- /shared/home/andrew@domain.local/cylc-run/s1/log/job 1/hell
o/01
[jobs-submit ret_code] 0
[jobs-submit out]
[TASK JOB SUMMARY]2021-11-08T16:25:51Z|1/hello/01|0|1012201
[TASK JOB COMMAND]2021-11-08T16:25:51Z|1/hello/01|[STDOUT] 1012201
2021-11-08T16:25:51Z INFO - [hello.1] status=ready: (internal)submitted at 2021-11-08T16:25:51Z for job(01)
2021-11-08T16:25:51Z DEBUG - [hello.1] -ready => submitted
2021-11-08T16:25:51Z INFO - [hello.1] -health check settings: submission timeout=None
2021-11-08T16:25:51Z DEBUG - BEGIN TASK PROCESSING
2021-11-08T16:25:51Z DEBUG - 0 task(s) de-queued
2021-11-08T16:25:51Z DEBUG - [hello.1] -forced spawning
2021-11-08T16:25:51Z DEBUG - END TASK PROCESSING (took 0.00322413444519 seconds)
2021-11-08T16:25:51Z DEBUG - Performing suite health check
2021-11-08T16:25:52Z DEBUG - Performing suite health check
2021-11-08T16:25:53Z DEBUG - Performing suite health check
2021-11-08T16:25:54Z DEBUG - Performing suite health check
<snip>

But for a local account it looks like this and finishes cleanly -

[buildadmin@RHELAK01-vm suite]$ cat log.20211108T162206Z 
2021-11-08T16:22:06Z DEBUG - Reading file /tut/suite.rc
2021-11-08T16:22:06Z DEBUG - Processed configuration dumped: /home/buildadmin/cylc-run/s1/suite.rc.processed
2021-11-08T16:22:06Z DEBUG - Expanding [runtime] namespace lists and parameters
2021-11-08T16:22:06Z DEBUG - Parsing the runtime namespace hierarchy
2021-11-08T16:22:06Z DEBUG - Parsing [special tasks]
2021-11-08T16:22:06Z DEBUG - Parsing the dependency graph
2021-11-08T16:22:06Z DEBUG - Configuring internal queues
2021-11-08T16:22:07Z INFO - Suite server: url=http://RHELAK01-vm:43058/ pid=1011695
2021-11-08T16:22:07Z INFO - Run: (re)start=0 log=1
2021-11-08T16:22:07Z INFO - Cylc version: 7.9.2
2021-11-08T16:22:07Z INFO - Run mode: live
2021-11-08T16:22:07Z INFO - Initial point: 1
2021-11-08T16:22:07Z INFO - Final point: 1
2021-11-08T16:22:07Z INFO - Cold Start 1
2021-11-08T16:22:07Z DEBUG - [hello.1] -released to the task pool
2021-11-08T16:22:07Z DEBUG - BEGIN TASK PROCESSING
2021-11-08T16:22:07Z DEBUG - [hello.1] -waiting => queued
2021-11-08T16:22:07Z DEBUG - 1 task(s) de-queued
2021-11-08T16:22:07Z INFO - [hello.1] -submit-num=01, owner@host=RHELAK01-vm
2021-11-08T16:22:07Z DEBUG - ['cylc', 'jobs-submit', '--debug', '--', '/home/buildadmin/cylc-run/s1/log/job'] ... # will invoke in batches, sizes=[1]
2021-11-08T16:22:07Z DEBUG - [hello.1] -queued => ready
2021-11-08T16:22:07Z DEBUG - END TASK PROCESSING (took 0.00897312164307 seconds)
2021-11-08T16:22:07Z DEBUG - Performing suite health check
2021-11-08T16:22:07Z DEBUG - Loading site/user global config files
2021-11-08T16:22:07Z DEBUG - Reading file /home/buildadmin/.cylc/global.rc
2021-11-08T16:22:07Z DEBUG - ['cylc', 'jobs-submit', '--debug', '--', '/home/buildadmin/cylc-run/s1/log/job', '1/hello/01']
2021-11-08T16:22:07Z DEBUG - Performing suite health check
2021-11-08T16:22:08Z DEBUG - [client-connect] buildadmin@RHELAK01-vm:cylc-message privilege='full-control' a33b5f20-ed9f-487b-a99f-6ed1cb0b9a8e
2021-11-08T16:22:08Z INFO - [client-command] put_messages buildadmin@RHELAK01-vm:cylc-message a33b5f20-ed9f-487b-a99f-6ed1cb0b9a8e
2021-11-08T16:22:08Z DEBUG - [jobs-submit cmd] cylc jobs-submit --debug -- /home/buildadmin/cylc-run/s1/log/job 1/hello/01
[jobs-submit ret_code] 0
[jobs-submit out]
[TASK JOB SUMMARY]2021-11-08T16:22:07Z|1/hello/01|0|1011718
[TASK JOB COMMAND]2021-11-08T16:22:07Z|1/hello/01|[STDOUT] 1011718
2021-11-08T16:22:08Z INFO - [hello.1] status=ready: (internal)submitted at 2021-11-08T16:22:07Z for job(01)
2021-11-08T16:22:08Z DEBUG - [hello.1] -ready => submitted
2021-11-08T16:22:08Z INFO - [hello.1] -health check settings: submission timeout=None
2021-11-08T16:22:08Z DEBUG - BEGIN TASK PROCESSING
2021-11-08T16:22:08Z DEBUG - 0 task(s) de-queued
2021-11-08T16:22:08Z DEBUG - [hello.1] -forced spawning
2021-11-08T16:22:08Z DEBUG - END TASK PROCESSING (took 0.000383138656616 seconds)
2021-11-08T16:22:08Z INFO - [hello.1] status=submitted: (received)started at 2021-11-08T16:22:08Z for job(01)
2021-11-08T16:22:08Z DEBUG - [hello.1] -submitted => running
2021-11-08T16:22:08Z INFO - [hello.1] -health check settings: execution timeout=None
2021-11-08T16:22:08Z DEBUG - Performing suite health check
2021-11-08T16:22:09Z DEBUG - BEGIN TASK PROCESSING
2021-11-08T16:22:09Z DEBUG - 0 task(s) de-queued
2021-11-08T16:22:09Z DEBUG - END TASK PROCESSING (took 0.000272035598755 seconds)
2021-11-08T16:22:09Z DEBUG - Performing suite health check
2021-11-08T16:22:10Z DEBUG - Performing suite health check
2021-11-08T16:22:11Z DEBUG - Performing suite health check
<snip>
2021-11-08T16:22:12Z DEBUG - Performing suite health check
2021-11-08T16:22:35Z DEBUG - Performing suite health check
2021-11-08T16:22:36Z DEBUG - Performing suite health check
2021-11-08T16:22:37Z DEBUG - Performing suite health check
2021-11-08T16:22:38Z DEBUG - [client-connect] buildadmin@RHELAK01-vm:cylc-message privilege='full-control' 92315667-d0f5-4a0c-bb41-e433
039a126f
2021-11-08T16:22:38Z INFO - [client-command] put_messages buildadmin@RHELAK01-vm:cylc-message 92315667-d0f5-4a0c-bb41-e433039a126f
2021-11-08T16:22:38Z INFO - [hello.1] status=running: (received)succeeded at 2021-11-08T16:22:38Z for job(01)
2021-11-08T16:22:38Z DEBUG - [hello.1] -running => succeeded
2021-11-08T16:22:38Z INFO - Suite shutting down - AUTOMATIC
2021-11-08T16:22:38Z INFO - DONE

I thought it might be an issue with the original RHEL VM. However, I've since deployed multiple RHEL VM's with the same config and they've all got the same problem. It's certainly a very peculiar issue!

Updated with full debug logs

hjoliver commented 3 years ago

That is strange. There's no sign of an internal error in the server log. I'll see if anyone on the team has access to RHEL with AD...

dwsutherland commented 3 years ago

Are you able to get results with:

$ curl -v -s -u cylc:$(cat ~/cylc-run/s1/.service/passphrase) --anyauth --insecure 'https://RHELAK01-vm:43094/get_cylc_version'

and/or

$ curl -v -s -u cylc:$(cat ~/cylc-run/s1/.service/passphrase) --anyauth --insecure 'https://RHELAK01-vm:43094/get_latest_state'

from where the suite is running? (I guess it would be the same 500 message)

andrew-kelleher commented 3 years ago

Hi @dwsutherland that actually works when logged into the RHEL VM using my AD creds.

My global.rc is configured with method = http so I had to swap in http for the URL's.

These are the responses when running a new suite, "s3". Getting the version -

*   Trying fe80::222:48ff:fe43:647e...
* TCP_NODELAY set
* connect to fe80::222:48ff:fe43:647e port 43010 failed: Connection refused
*   Trying 10.100.0.14...
* TCP_NODELAY set
* Connected to RHELAK01-vm (10.100.0.14) port 43010 (#0)
> GET /get_cylc_version HTTP/1.1
> Host: RHELAK01-vm:43010
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 401 Unauthorized
< Date: Tue, 09 Nov 2021 16:55:22 GMT
< Content-Length: 1066
< Content-Type: text/html;charset=utf-8
< Www-Authenticate: Digest realm="s3", nonce="1636476922:f9a2716a58082b9cf343278f0e9626d6", algorithm="MD5", qop="auth"
< Server: CherryPy/unknown
< 
* Ignoring the response-body
* Connection #0 to host RHELAK01-vm left intact
* Issue another request to this URL: 'http://RHELAK01-vm:43010/get_cylc_version'
* Found bundle for host RHELAK01-vm: 0x558428e5f660 [can pipeline]
* Re-using existing connection! (#0) with host RHELAK01-vm
* Connected to RHELAK01-vm (10.100.0.14) port 43010 (#0)
* Server auth using Digest with user 'cylc'
> GET /get_cylc_version HTTP/1.1
> Host: RHELAK01-vm:43010
> Authorization: Digest username="cylc", realm="s3", nonce="1636476922:f9a2716a58082b9cf343278f0e9626d6", uri="/get_cylc_version", cnon
ce="OGY1YjIwMmMwNmZmNjViZGFlNDFlNjU2MDY0MmIzNmM=", nc=00000001, qop=auth, response="71b2b2f2be56da28bfa66045ecb35a06", algorithm="MD5"
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Tue, 09 Nov 2021 16:55:22 GMT
< Content-Length: 7
< Content-Type: application/json
< Server: CherryPy/unknown
< 
* Connection #0 to host RHELAK01-vm left intact

and getting the state -

*   Trying fe80::222:48ff:fe43:647e...
* TCP_NODELAY set
* connect to fe80::222:48ff:fe43:647e port 43010 failed: Connection refused
*   Trying 10.100.0.14...
* TCP_NODELAY set
* Connected to RHELAK01-vm (10.100.0.14) port 43010 (#0)
> GET /get_latest_state HTTP/1.1
> Host: RHELAK01-vm:43010
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 401 Unauthorized
< Date: Tue, 09 Nov 2021 16:52:46 GMT
< Content-Length: 1066
< Content-Type: text/html;charset=utf-8
< Www-Authenticate: Digest realm="s3", nonce="1636476766:9b5d6d73a29943c0ddbad9f456448af5", algorithm="MD5", qop="auth"
< Server: CherryPy/unknown
< 
* Ignoring the response-body
* Connection #0 to host RHELAK01-vm left intact
* Issue another request to this URL: 'http://RHELAK01-vm:43010/get_latest_state'
* Found bundle for host RHELAK01-vm: 0x555ae28bb660 [can pipeline]
* Re-using existing connection! (#0) with host RHELAK01-vm
* Connected to RHELAK01-vm (10.100.0.14) port 43010 (#0)
* Server auth using Digest with user 'cylc'
> GET /get_latest_state HTTP/1.1
> Host: RHELAK01-vm:43010
> Authorization: Digest username="cylc", realm="s3", nonce="1636476766:9b5d6d73a29943c0ddbad9f456448af5", uri="/get_latest_state", cnon
ce="ZDFjNDQ3MWVhNTQ3ZTQ3YmU2OTEyNTdjNzA2NjM0YTQ=", nc=00000001, qop=auth, response="f60cfe944b586409ec7947e17a6c6669", algorithm="MD5"
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Tue, 09 Nov 2021 16:52:46 GMT
< Content-Length: 1408
< Content-Type: application/json
< Server: CherryPy/unknown
< 
{"ancestors": {"root": ["root"], "hello": ["hello", "root"]}, "descendants": {"root": ["hello"]}, "err_content": "", "ancestors_pruned"
: {"hello": ["hello", "root"]}, "summary": [{"reloading": false, "time zone info": {"hours": 0, "string_basic": "Z", "string_extended":
 "Z", "minutes": 0}, "last_updated": 1636476130.156971, "newest runahead cycle point string": null, "status_string": "running to stop a
t 1", "suite_urls": {"suite": "", "root": "", "hello": ""}, "newest cycle point string": "1", "states": ["submitted"], "state totals": 
{"submitted": 1}, "namespace definition order": ["hello", "root"], "oldest cycle point string": "1", "run_mode": "live"}, {"hello.1": {
"started_time_string": null, "execution_time_limit": null, "finished_time": null, "submit_num": 1, "name": "hello", "latest_message": "
submitted", "started_time": null, "title": "", "state": "submitted", "batch_sys_name": "background", "logfiles": [], "host": "localhost
", "submitted_time": 1636476129.0, "mean_elapsed_time": null, "submit_method_id* Connection #0 to host RHELAK01-vm left intact
": "1178567", "finished_time_string": null, "label": "1", "job_hosts": {"1": "localhost"}, "spawned": "True", "submitted_time_string": 
"2021-11-09T16:42:09Z", "description": ""}}, {"root.1": {"state": "submitted", "label": "1", "title": "", "name": "root", "description"
: ""}}], "err_size": 0, "mean_main_loop_interval": 1.0011077165603637, "cylc_version": "7.9.2", "full_mode": true}
hjoliver commented 3 years ago

Huh, it should work the same with curl and cylc, so I'm at a loss to explain this so far, sorry @andrew-kelleher

I suppose you've checked that the port number is the same in the "cannot connect" messages as is reported at start-up by the cylc scheduler? (It is not the same in your example above, but maybe the job.err and suite log were taken from different runs?)

Another idea, are you in a position to use Cylc 8 instead? We're making the final beta release any day now,. Cylc 8 does not use http for job-to-server comms, so if the problem does not occur there it might not be worth wasting time trying to diagnose the old system.

andrew-kelleher commented 3 years ago

Thanks @hjoliver. You're correct, the different port # is due to different runs.

Cylc 8 is on the roadmap for our customer, but not for a while yet unfortunately.

Our customer is running this configuration albeit with older versions of RHEL and Cylc. We'll work with them to understand the differences between their environment and ours. I'll update this issue as/when we make some progress.

dwsutherland commented 3 years ago

My global.rc is configured with method = http so I had to swap in http for the URL's.

Just out of curiosity what happens if you remove this section from the global.rc? (perhaps the proxy is filtering out non-secure? as https is the default)

Or does the suite complain about not having certificates or something?