ManageIQ / manageiq

ManageIQ Open-Source Management Platform
https://manageiq.org
Apache License 2.0
1.34k stars 897 forks source link

Cockpit fails to launch in pods #20829

Open jrafanie opened 3 years ago

jrafanie commented 3 years ago

I'm not sure if this works in appliances but in pods, after #20827 #20823, opening here since the problematic code is in core..

{"@timestamp":"2020-11-18T22:51:15.378843 ","hostname":"orchestrator-6b56cdc5fd-plh9j","pid":1293,"tid":"2ab9f2e63968","level":"info","message":"MiqCockpitWsWorker::Runner started. ID [24], PID [1293], GUID [88d8560d-9edf-4a85-8fe5-758e78ee4de4], Zone [default], Role [automate,cockpit_ws,database_operations,database_owner,ems_inventory,ems_operations,event,remote_console,reporting,scheduler,smartstate,user_interface,web_services]"}
[----] I, [2020-11-18T22:51:15.379001 #1293:2ab9f2e63968]  INFO -- : MiqCockpitWsWorker::Runner started. ID [24], PID [1293], GUID [88d8560d-9edf-4a85-8fe5-758e78ee4de4], Zone [default], Role [automate,cockpit_ws,database_operations,database_owner,ems_inventory,ems_operations,event,remote_console,reporting,scheduler,smartstate,user_interface,web_services]
[----] I, [2020-11-18T22:51:15.427438 #1293:2ab9f2e63968]  INFO -- : MIQ(MiqCockpitWsWorker::Runner#stop_drb_service) MIQ(MiqCockpitWsWorker::Runner) stopped drb Process at
[----] I, [2020-11-18T22:51:15.437525 #1293:2ab9f2e63968]  INFO -- : MIQ(MiqCockpitWsWorker::Runner#start_drb_service) MIQ(MiqCockpitWsWorker::Runner) Started drb Process at drbunix:///tmp/cockpit20201118-1293-gp32ay
[----] I, [2020-11-18T22:51:15.437711 #1293:2ab9f2e63968]  INFO -- : MIQ(MiqCockpitWsWorker::Runner#start_cockpit_ws) MIQ(MiqCockpitWsWorker::Runner) Starting cockpit-ws Process
[----] I, [2020-11-18T22:51:15.437848 #1293:2ab9f2e63968]  INFO -- : MIQ(MiqCockpitWsWorker::Runner#cockpit_ws_run) MIQ(MiqCockpitWsWorker::Runner) cockpit-ws process starting
{"@timestamp":"2020-11-18T22:51:15.427282 ","hostname":"orchestrator-6b56cdc5fd-plh9j","pid":1293,"tid":"2ab9f2e63968","level":"info","message":"MIQ(MiqCockpitWsWorker::Runner#stop_drb_service) MIQ(MiqCockpitWsWorker::Runner) stopped drb Process at "}
{"@timestamp":"2020-11-18T22:51:15.437346 ","hostname":"orchestrator-6b56cdc5fd-plh9j","pid":1293,"tid":"2ab9f2e63968","level":"info","message":"MIQ(MiqCockpitWsWorker::Runner#start_drb_service) MIQ(MiqCockpitWsWorker::Runner) Started drb Process at drbunix:///tmp/cockpit20201118-1293-gp32ay"}
{"@timestamp":"2020-11-18T22:51:15.437627 ","hostname":"orchestrator-6b56cdc5fd-plh9j","pid":1293,"tid":"2ab9f2e63968","level":"info","message":"MIQ(MiqCockpitWsWorker::Runner#start_cockpit_ws) MIQ(MiqCockpitWsWorker::Runner) Starting cockpit-ws Process"}
{"@timestamp":"2020-11-18T22:51:15.437763 ","hostname":"orchestrator-6b56cdc5fd-plh9j","pid":1293,"tid":"2ab9f2e63968","level":"info","message":"MIQ(MiqCockpitWsWorker::Runner#cockpit_ws_run) MIQ(MiqCockpitWsWorker::Runner) cockpit-ws process starting"}
[----] E, [2020-11-18T22:51:15.448402 #1293:2ab9f2e63968] ERROR -- : AwesomeSpawn: which exit code: 1
[----] E, [2020-11-18T22:51:15.448608 #1293:2ab9f2e63968] ERROR -- : AwesomeSpawn: which: no apachectl in (/opt/manageiq/manageiq-gemset/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)

{"@timestamp":"2020-11-18T22:51:15.448188 ","hostname":"orchestrator-6b56cdc5fd-plh9j","pid":1293,"tid":"2ab9f2e63968","level":"err","message":"AwesomeSpawn: which exit code: 1"}
{"@timestamp":"2020-11-18T22:51:15.448510 ","hostname":"orchestrator-6b56cdc5fd-plh9j","pid":1293,"tid":"2ab9f2e63968","level":"err","message":"AwesomeSpawn: which: no apachectl in (/opt/manageiq/manageiq-gemset/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)\n"}

I think it's failing in code that's expecting access to apachectl or apache in general: https://github.com/ManageIQ/manageiq/blob/90c232339e2e313f476a82aec4e43ff74f1c3649/lib/miq_cockpit.rb#L46

The monitor thread launches but the runner seems to fail. Thankfully, it doesn't look to constantly restart.

Related to https://github.com/ManageIQ/manageiq-pods/issues/531 and https://github.com/ManageIQ/manageiq-pods/issues/595

jrafanie commented 3 years ago

Note, see this comment showing how cockpit still doesn't work in pods, but at least it doesn't thrash the system anymore.

kbrock commented 3 years ago

Users or administrators should never need to start this program as it automatically started by systemd(1) on bootup ref

cockpit will then run various services to administer this server. I need to look a little closer to see if our goal is to run the bridge, which allows us to skip security. we may have wanted to do this since we can use our user look up tables to lookup security and not rely upon the os... not saying this is the way we want to go but offering a guess as to the intent.

I don't think we want this running on our main server and not as a local custom service They do implement single signon type logic but that is via ipa

Fryguy commented 3 years ago

Wonder if it's mostly because we never merged https://github.com/ManageIQ/manageiq-pods/issues/97

Fryguy commented 3 years ago

To be more accurate, we don't have apache in the pods either, so that makes sense.

Fryguy commented 3 years ago

So I dug into this with @jrafanie and this is how it works, more or less:

The cockpit integration is more or less like a remote console where users can proxy cockpit traffic to some other machine (in our case a Vm, Host, or a ContainerNode) through the appliance that has the cockpit role.

When someone turns on the cockpit role, a thread is started (used to be a full blown worker, but now it's just a thread). The thread eventually checks that Apache is available [1], but that's mostly not important anymore because we have the apache config baked into our appliance [2]. In the past that configuration was actually dynamically generated, but not any longer. This is what is currently failing in pods, because which is not available. That seems like a rather simple fix to include which, but it will likely still fail as we move forward because of what else it does.

Eventually it will try to start cockpit-ws as a child process. What this tool does is start a local webservice on port 9002 and is designed to proxy cockpit traffic to other systems. With the local appliance's apache config set to redirect to localhost:9002 [3], this effectively exposes that webservice through our Apache instance on the appliance.

In the ManageIQ UI, a button is tied to the cockpit of a Vm, Node, or ContainerNode by presenting a URL that looks roughly like https://<hostname_of_miq_server_with_cockpit_role>/cws/=<ip_or_hostname_of_remote_server>. That URL goes through that appliance's Apache, redirects to its localhost:9002, and cockpit-ws sends that traffic over to the real server, and then it's all reverse proxied back to the user.

So, overall, the cockpit integration is, IMO, a glorified remote console, just instead of binary console traffic it's cockpit https traffic being proxied through the appliance that has the cockpit role. The rationale for it makes sense and was described in https://github.com/ManageIQ/manageiq/pull/12506

ManageIQ currently links to cockpit by providing a web interface button. This takes users to https://domain.or.ip:9090. This does not work well for many common setups. Because

1) The target server must be reachable by the end-users machine via the browser. This doesn't work when the target servers are not routeable from the users network or behind firewalls where port 9090 is not exposed publicly. 2) The target server needs to expose a certificate that the user's browser trusts. This can be problematic especially when addressing machines directly by IP. Asking users to accept self-signed certificates is not good practice.

I sort of lump this together conceptually with other remote consoles. So, the question is, do we keep it or remove it? If we keep it, how can we do this in podified? In my opinion, since I see it like another remote console, whatever decision we make probably has similar rationales for keeping or removing other remote consoles as well.

If we keep it, I think what we should do is either bake this into the remote console worker instead of in the manageiq-orhcestrator where it currently lives as a thread, or we should expose it as a separate worker that the httpd container can route to. Either way, that would allow us to keep some parity between podified and appliances. We will probably need to also investigate if cockpit-ws itself can be run inside a container, and more importantly as non-root.

cc @chessbyte @agrare @jrafanie @kbrock

jrafanie commented 3 years ago

This is what is currently failing in pods, because which is not available. That seems like a rather simple fix to include which, but it will likely still fail as we move forward because of what else it does.

Just to clarify, which exists in the podified cockpit worker (a thread in the server's monitor code) but apachectl doesn't:

AwesomeSpawn: which: no apachectl in (/opt/manageiq/manageiq-gemset/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
miq-bot commented 1 year ago

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.