Closed mobin-2008 closed 6 days ago
I can't reproduce after over 30000 iterations:
Service 'sshd' has been disabled.
Service 'sshd' has been enabled.
Service 'sshd' started.
37225
Service 'sshd' has been disabled.
Service 'sshd' has been enabled.
Service 'sshd' started.
37226
I can't reproduce after over 30000 iterations:
Service 'sshd' has been disabled. Service 'sshd' has been enabled. Service 'sshd' started. 37225 Service 'sshd' has been disabled. Service 'sshd' has been enabled. Service 'sshd' started. 37226
The problem is that your sshd
service is not fail-fast, Look at my example:
# "srv" service
type = process
command = not-exist
In this case it get stuck for me and it's really random, one time it will get stuck after ~700 tries, next time it will stuck after second try :/
Yes, I was able to reproduce with a service that fails to execute.
I found out why it happens. Let's take a look at the dinitctl enable
process:
SERVICESTATUS
wait_for_reply()
(Which does ignore and skip any information packet (such as SERVICEEVENT
)STARTED
, wait_service_state()
.This system looks good but there is a race condition: If the SERVICEEVENT
is sent before the SERVICESTATUS
, We will lost it in the step 4 and in the step 6, We will wait for our lost SERVICEEVENT
. strace
can confirm this problem:
write(3, "\22\1\0\0\0", 5) = 5 <-- SERVICESTATUS Request
read(3, "d\21\1\0\0\0\2\3\0\0\4\22\0\2\0\0\0", 847) = 17 <-- Interesting SERVICEEVENT (EXECFAILED)
read(3, "d\21\1\0\0\0\1\0\0\0\4\22\0\2\0\0\0", 830) = 17 <-- Another interesting SERVICEEVENT (STOPPED)
read(3, "F\0\0\0\0\4\22\0\2\0\0\0", 813) = 12 <-- SERVICESTATUS Response
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x4), ...}) = 0
write(1, "Service 'srv' has been enabled.\n", 32Service 'srv' has been enabled.
) = 32
read(3, <-- dinitctl get stuck in here
The quick fix for this is to change this: https://github.com/davmac314/dinit/blob/3867cf1766134980d2c3cd6f441276217af498e9/src/dinitctl.cc#L1930-L1934 To this:
if (enable) {
if (current != service_state_t::STARTED && target != service_state_t::STOPPED) {
wait_service_state(socknum, rbuffer, to_handle, to, false /* start */, verbose);
}
else {
std::cerr << "Service Failed to start." << std::endl; // TODO: Show more info about the failure
return 1;
}
}
@davmac314 Is it a good fix? Or we need to re-implement the SERVICEEVENT
processing to be race-free?
I don't think there is a race in dinit itself, it's just the events can happen in an order that's not what dinitctl is expecting. The fix you proposed might be ok, I need to have a proper look. Maybe it can be refactored a little.
Describe the bug This is a really weird one. Sometimes the
dinitctl enable
get stuck after the "enabled" message.To Reproduce
boot
andboot.d
)Create a loop with disable and enable:
Expected behavior It should not stuck.
Additional context The dinitctl enables a service with this process:
waits-for.d
and add the service in that directorySERVICESTATUS
commandwait_service_state()
The problem is dinitctl expects a
SERVICEEVENT
but sometimes it's missing and dinitctl will stuck onread()
syscall (int r = rbuffer.fill_to(socknum, 2);
).