davmac314 / dinit

Service monitoring / "init" system
Apache License 2.0
582 stars 45 forks source link

dinitctl: `dinitctl enable` can get stuck in service activation status report because of missing `SERVICEEVENT` #363

Closed mobin-2008 closed 6 days ago

mobin-2008 commented 1 month ago

Describe the bug This is a really weird one. Sometimes the dinitctl enable get stuck after the "enabled" message.

To Reproduce

  1. Have a minimal service set (including boot and boot.d)
  2. Add a fail-fast service (for example a service with non-exist executable path).
  3. Create a loop with disable and enable:

    #!/bin/sh
    
    SERVICE="srv"
    SOCKET="/tmp/dinitctl"
    
    i=0
    while :; do
      i=$(( i + 1 ))
      echo $i
      dinitctl disable "$SERVICE" --socket-path "$SOCKET"
      dinitctl enable "$SERVICE" --socket-path "$SOCKET"
    done
  4. It will stuck after a random number of tries.

Expected behavior It should not stuck.

Additional context The dinitctl enables a service with this process:

  1. dinitctl loads a service
  2. dinitctl finds a waits-for.d and add the service in that directory
  3. dinitctl starts the service
  4. dinitctl checks the current state of service with SERVICESTATUS command
  5. dinit responds.
  6. if the status of service wasn't something that dinitctl expects, run wait_service_state()
  7. Report the success or failure.

The problem is dinitctl expects a SERVICEEVENT but sometimes it's missing and dinitctl will stuck on read() syscall (int r = rbuffer.fill_to(socknum, 2);).

davmac314 commented 1 month ago

I can't reproduce after over 30000 iterations:

Service 'sshd' has been disabled.
Service 'sshd' has been enabled.
Service 'sshd' started.
37225
Service 'sshd' has been disabled.
Service 'sshd' has been enabled.
Service 'sshd' started.
37226
mobin-2008 commented 1 month ago

I can't reproduce after over 30000 iterations:

Service 'sshd' has been disabled.
Service 'sshd' has been enabled.
Service 'sshd' started.
37225
Service 'sshd' has been disabled.
Service 'sshd' has been enabled.
Service 'sshd' started.
37226

The problem is that your sshd service is not fail-fast, Look at my example:

# "srv" service

type = process
command = not-exist

In this case it get stuck for me and it's really random, one time it will get stuck after ~700 tries, next time it will stuck after second try :/

davmac314 commented 1 month ago

Yes, I was able to reproduce with a service that fails to execute.

mobin-2008 commented 1 month ago

I found out why it happens. Let's take a look at the dinitctl enable process:

  1. Enable the service
  2. Start the service
  3. Send a SERVICESTATUS
  4. wait_for_reply() (Which does ignore and skip any information packet (such as SERVICEEVENT)
  5. Process the status
  6. If the current status is not STARTED, wait_service_state().

This system looks good but there is a race condition: If the SERVICEEVENT is sent before the SERVICESTATUS, We will lost it in the step 4 and in the step 6, We will wait for our lost SERVICEEVENT. strace can confirm this problem:

write(3, "\22\1\0\0\0", 5)              = 5                          <-- SERVICESTATUS Request 
read(3, "d\21\1\0\0\0\2\3\0\0\4\22\0\2\0\0\0", 847) = 17             <-- Interesting SERVICEEVENT (EXECFAILED)
read(3, "d\21\1\0\0\0\1\0\0\0\4\22\0\2\0\0\0", 830) = 17             <-- Another interesting SERVICEEVENT (STOPPED)
read(3, "F\0\0\0\0\4\22\0\2\0\0\0", 813) = 12                        <-- SERVICESTATUS Response
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x4), ...}) = 0
write(1, "Service 'srv' has been enabled.\n", 32Service 'srv' has been enabled.
) = 32
read(3,                                                              <-- dinitctl get stuck in here
mobin-2008 commented 1 month ago

The quick fix for this is to change this: https://github.com/davmac314/dinit/blob/3867cf1766134980d2c3cd6f441276217af498e9/src/dinitctl.cc#L1930-L1934 To this:

        if (enable) {
            if (current != service_state_t::STARTED && target != service_state_t::STOPPED) {
                wait_service_state(socknum, rbuffer, to_handle, to, false /* start */, verbose);
            }
            else {
                std::cerr << "Service Failed to start." << std::endl; // TODO: Show more info about the failure
                return 1;
            }
        }

@davmac314 Is it a good fix? Or we need to re-implement the SERVICEEVENT processing to be race-free?

davmac314 commented 1 month ago

I don't think there is a race in dinit itself, it's just the events can happen in an order that's not what dinitctl is expecting. The fix you proposed might be ok, I need to have a proper look. Maybe it can be refactored a little.