ManageIQ / manageiq

ManageIQ Open-Source Management Platform
https://manageiq.org
Apache License 2.0
1.35k stars 900 forks source link

[intermittent]: evm service failed to auto start upon miq-nightly-201608052000 appliance boot #10300

Closed amavinag closed 4 years ago

amavinag commented 8 years ago

[intermittent]: evm service failed to auto start upon appliance boot. however, "evmserverd" starts running without problem if started manually. this issue is seen intermittently on virtualcenter and rhevm.

version: miq-nightly-201608052000 anaconda-post log: https://gist.github.com/amavinag/a7a008820188398be13db7366db1d7bd evmlog: https://gist.github.com/amavinag/149c4337e4d64af8096ef2c346ca72b4

carbonin commented 8 years ago

I've seen this before, but was always unable to pin it down. My current theory is that the way systemd is ordering the jobs may be not working correctly in some cases.

The next time this comes up run systemctl list-jobs on the appliance and put the output here. That will tell us if there are any current systemd unit tasks running (if I'm right I would expect the evmserverd start task to be shown as queued there).

/cc @simaishi

amavinag commented 8 years ago

@carbonin: I see this problem again with "miq-nightly-201608152000". no jobs running in "systemctl list-jobs"

[root@~]# systemctl list-jobs No jobs running.

[root@ ~]# systemctl | grep evm evm-watchdog.service loaded active running evm watchdog process

[root@ ~]# tail -f /var/www/miq/vmdb/log/evm.log [----] I, [2016-08-16T11:38:00.631226 #8511:b6597c] INFO -- : MIQ(SessionStore) Using session_store: ActionDispatch::Session::MemCacheStore [----] I, [2016-08-16T11:38:01.230512 #8511:b6597c] INFO -- : MIQ(Vmdb::Initializer.init) - Program Name: bin/rails, PID: 8511, ENV['MIQ_GUID']: , ENV['EVMSERVER']: [----] I, [2016-08-16T11:38:10.832402 #10473:12c3980] INFO -- : MIQ(Vmdb::Loggers.apply_config) Log level for vim.log has been changed to [WARN] [----] I, [2016-08-16T11:38:11.495127 #10473:12c3980] INFO -- : MIQ(SessionStore) Using session_store: ActionDispatch::Session::MemCacheStore [----] I, [2016-08-16T11:38:11.666478 #10473:12c3980] INFO -- : MIQ(Vmdb::Initializer.init) - Program Name: /opt/rubies/ruby-2.2.5/bin/rake, PID: 10473, ENV['MIQ_GUID']: , ENV['EVMSERVER']: [----] I, [2016-08-16T11:38:41.942822 #11927:c5b980] INFO -- : MIQ(Vmdb::Loggers.apply_config) Log level for vim.log has been changed to [WARN] [----] I, [2016-08-16T11:38:42.605064 #11927:c5b980] INFO -- : MIQ(SessionStore) Using session_store: ActionDispatch::Session::MemCacheStore [----] I, [2016-08-16T11:38:42.759884 #11927:c5b980] INFO -- : MIQ(Vmdb::Initializer.init) - Program Name: bin/rails, PID: 11927, ENV['MIQ_GUID']: , ENV['EVMSERVER']: [----] I, [2016-08-16T11:38:44.737053 #11927:c5b980] INFO -- : MIQ(MiqDatabase.seed) Creating MiqDatabase record [----] I, [2016-08-16T11:38:44.945779 #11927:c5b980] INFO -- : MIQ(MiqRegion.seed) Creating Region [0]

amavinag commented 8 years ago

we are observing this issue more often in recent nightly builds. is there any update on this issue? /cc @simaishi @dajohnso

carbonin commented 8 years ago

I opened a PR that will hopefully give us some additional information around the spot where we start and enable the evmserverd service.

If this happens again, check the appliance_console.log. It may have more detail.

simaishi commented 7 years ago

I just hit this issue and I don't see anything different in appliance_console.log...

simaishi commented 7 years ago

@carbonin other people seem to be hitting this issue. Since we're not sure what's causing this to happen, is it possible to add a retry or something if it's enabled but not started?

simaishi commented 7 years ago

Encountered the issue again and seeing a bit more details in appliance_console.log this time...

E, [2017-06-15T12:16:04.881923 #11972] ERROR -- : Failed to enable and start evmserverd service: /bin/systemctl exit code:
E, [2017-06-15T12:16:04.882098 #11972] ERROR -- : /opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/gems/awesome_spawn-1.4.1/lib/awesome_spawn.rb:105:in `run!'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/gems/linux_admin-0.20.1/lib/linux_admin/common.rb:24:in `run!'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/gems/linux_admin-0.20.1/lib/linux_admin/service/systemd_service.rb:18:in `start'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-gems-pending-22f8dd86d6e8/lib/gems/pending/appliance_console/database_configuration.rb:234:in `block in start_evm'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-gems-pending-22f8dd86d6e8/lib/gems/pending/appliance_console/database_configuration.rb:232:in `fork'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-gems-pending-22f8dd86d6e8/lib/gems/pending/appliance_console/database_configuration.rb:232:in `start_evm'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-gems-pending-22f8dd86d6e8/lib/gems/pending/appliance_console/internal_database_configuration.rb:101:in `post_activation'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-gems-pending-22f8dd86d6e8/lib/gems/pending/appliance_console/cli.rb:192:in `set_internal_db'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-gems-pending-22f8dd86d6e8/lib/gems/pending/appliance_console/cli.rb:165:in `set_db'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-gems-pending-22f8dd86d6e8/lib/gems/pending/appliance_console/cli.rb:148:in `run'
/opt/rubies/ruby-2.3.1/lib/ruby/gems/2.3.0/bundler/gems/manageiq-gems-pending-22f8dd86d6e8/lib/gems/pending/appliance_console/cli.rb:311:in `parse'
/usr/bin/appliance_console_cli:11:in `<main>'
miq-bot commented 6 years ago

This issue has been automatically marked as stale because it has not been updated for at least 6 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions!

simaishi commented 5 years ago

@carbonin This is still happening... anything else we can do to troubleshoot?

carbonin commented 5 years ago

I haven't had time to look into this, but I've also seen it. I added the pinned label so it won't get marked as stale again.

In the times I've seen it, I've still been able to just start up evmserverd without trouble, but didn't see anything that could have caused it. It really seems like a timing issue.

fbladilo commented 5 years ago

@carbonin @simaishi Happening to me consistently, installing on RHV using Ovirt.Manageiq role. So far 5 out of 5 failed, I'm able to start it manually right afterwards. Let me know if I can help on debugging or feedback. ApplianceConsole is taking ~25 seconds to initialize, that should be good enough to allow things to settle, the failure comes 7 seconds later but without an exit code from systemctl it's hard to say why it failed.

I, [2018-12-06T09:59:07.372162 #4849]  INFO -- : MIQ(ManageIQ::ApplianceConsole::InternalDatabaseConfiguration#initialize_postgresql) : starting
I, [2018-12-06T09:59:19.736010 #4849]  INFO -- : MIQ(ManageIQ::ApplianceConsole::InternalDatabaseConfiguration#initialize_postgresql) : complete
I, [2018-12-06T09:59:32.833896 #4849]  INFO -- : MIQ(ManageIQ::ApplianceConsole::InternalDatabaseConfiguration#create_region) : starting
I, [2018-12-06T10:00:03.476615 #4849]  INFO -- : MIQ(ManageIQ::ApplianceConsole::InternalDatabaseConfiguration#create_region) : complete
E, [2018-12-06T10:00:10.163587 #15587] ERROR -- : Failed to enable and start evmserverd service: /bin/systemctl exit code: 
E, [2018-12-06T10:00:10.163879 #15587] ERROR -- : /usr/local/lib/ruby/gems/2.4.0/gems/awesome_spawn-1.4.1/lib/awesome_spawn.rb:105:in `run!'
/usr/local/lib/ruby/gems/2.4.0/gems/linux_admin-1.2.2/lib/linux_admin/common.rb:24:in `run!'
/usr/local/lib/ruby/gems/2.4.0/gems/linux_admin-1.2.2/lib/linux_admin/service/systemd_service.rb:18:in `start'
/usr/local/lib/ruby/gems/2.4.0/gems/manageiq-appliance_console-3.3.0/lib/manageiq/appliance_console/database_configuration.rb:237:in `block in start_evm'
/usr/local/lib/ruby/gems/2.4.0/gems/manageiq-appliance_console-3.3.0/lib/manageiq/appliance_console/database_configuration.rb:235:in `fork'
/usr/local/lib/ruby/gems/2.4.0/gems/manageiq-appliance_console-3.3.0/lib/manageiq/appliance_console/database_configuration.rb:235:in `start_evm'
/usr/local/lib/ruby/gems/2.4.0/gems/manageiq-appliance_console-3.3.0/lib/manageiq/appliance_console/internal_database_configuration.rb:108:in `post_activation'
/usr/local/lib/ruby/gems/2.4.0/gems/manageiq-appliance_console-3.3.0/lib/manageiq/appliance_console/cli.rb:225:in `set_internal_db'
/usr/local/lib/ruby/gems/2.4.0/gems/manageiq-appliance_console-3.3.0/lib/manageiq/appliance_console/cli.rb:192:in `set_db'
/usr/local/lib/ruby/gems/2.4.0/gems/manageiq-appliance_console-3.3.0/lib/manageiq/appliance_console/cli.rb:170:in `run'
/usr/local/lib/ruby/gems/2.4.0/gems/manageiq-appliance_console-3.3.0/lib/manageiq/appliance_console/cli.rb:412:in `parse'
/usr/local/lib/ruby/gems/2.4.0/gems/manageiq-appliance_console-3.3.0/bin/appliance_console_cli:7:in `<top (required)>'
/usr/local/bin/appliance_console_cli:23:in `load'
/usr/local/bin/appliance_console_cli:23:in `<main>'
bthurber commented 5 years ago

I can confirm this issue persists with downstream CF builds as well. Upon every reboot I must start EVM server.

carbonin commented 5 years ago

That is very different @bthurber.

In this issue the service is getting enabled, but not started on first boot. This is the expected behavior for ManageIQ, but not for CloudForms.

In CloudForms the user is expected to configure a database (either create or join a region) for each new appliance, but after that the service is expected to be started and enabled. It sounds like your service was started on first configuration, but not enabled which is the opposite of this issue, where the service is enabled, but not started.

simaishi commented 5 years ago

@gtanzillo We continue seeing this problem, is there anything we can do?

Sometime I see Failed to enable and start evmserverd service: /bin/systemctl exit code: (without an actual exit code) in appliance_console.log and other times I don't even see that message. In both cases I can manually start evmserverd and things work after that.

chessbyte commented 5 years ago

@simaishi does this happen with the systemd integration that @agrare recently implemented?

agrare commented 5 years ago

Anything is possible but most likely not, the systemd work went in in May '19 most of these reports are from before that and the systemd ruby gems aren't installed by default.

simaishi commented 5 years ago

I'm not sure exactly what changes with the systemd work that was done... Would it change how we start evmserverd? Based on the logs I saw, it looked like evmserverd service was enabled but not started at all.

agrare commented 5 years ago

@simaishi short summary, we can configurably use systemd to start and manage miq_worker processes rather than miq_server/worker_management. There is a config setting for this and the main evmserverd.service file is not impacted

simaishi commented 5 years ago

@carbonin I was able to reproduce this issue using ivanchuk-1 vagrant image and can confirm calling systemctl enable with --now option in LinuxAdmin fixes the issue (rather than calling enable.start).

Can you please take a look and fix this properly? (My "fix" works only for testing as I just duplicated enable method and added --now option 😄)

Thanks to the chat we had this morning... that gave me a hint.

simaishi commented 4 years ago

I deployed an appliance which has manageiq-appliance_console 5.3.2. For the first time, I actually see a message printed for the exit code:

E, [2020-03-10T08:44:03.055517 #6779] ERROR -- : Failed to enable and start evmserverd service: /bin/systemctl exit code: error was: Created symlink /etc/systemd/system/multi-user.target.wants/evmserverd.service → /usr/lib/systemd/system/evmserverd.service.

evmserverd started fine, though. Looks like that message was treated as error and that was preventing start to be called after enable before, but since enable and start are done in 1 command now, it works?

carbonin commented 4 years ago

@simaishi Is the service enabled?

simaishi commented 4 years ago

Yes, enabled and started.

simaishi commented 4 years ago

Adding --quiet to systemctl command might get rid of that error?

jrafanie commented 4 years ago

I wonder if it's writing that output to standard output or exiting non-zero.

jrafanie commented 4 years ago

Sorry, standard error