Linutronix / elbe

Embedded Linux Build Environment
https://elbe-rfs.org
GNU General Public License v3.0
163 stars 59 forks source link

Restart ELBE daemon service on failure #338

Closed koberbe-jh closed 1 year ago

koberbe-jh commented 2 years ago

The ELBE daemon service might crash due to internal errors or resource shortages. At the moment the service does not recover from this state. The change makes the service restart when its execution failed. This is especially helpful when running the ELBE initvm in automated environments such as CI/CD infrastructures.

bgermann commented 2 years ago

Can you please provide a log of the error?

jogness commented 2 years ago

If the elbe-daemon dies, it would probably be better to have the system reboot.

koberbe-jh commented 2 years ago

Can you please provide a log of the error?

I'm sorry, but log rotation already removed it. I will post it once I see the errors again.

As far as I remember the process was killed because there was a lack of memory. So I increased the memory of the initvm. The 'restart on failure' is meant to be an additional mitigation strategy.

koberbe-jh commented 2 years ago

If the elbe-daemon dies, it would probably be better to have the system reboot.

That's what I do manually. Would you trigger a reboot once a connection attempt to the initvm fails?

jogness commented 2 years ago

If the elbe-daemon dies, it would probably be better to have the system reboot.

That's what I do manually. Would you trigger a reboot once a connection attempt to the initvm fails?

I am not necessarily suggesting that the system reboots whenever elbe-daemon dies. But I have never seen a problem where starting a dead elbe-daemon actually solved the problem. Usually if elbe-deamon dies, the only thing that works is rebooting the initvm.

I have no idea why rebooting helps. Maybe there are some temp files laying around that confuse elbe-daemon.

IMHO elbe-daemon should be fixed, so that it doesn't die. ;-)

koberbe-jh commented 2 years ago

If the elbe-daemon dies, it would probably be better to have the system reboot.

That's what I do manually. Would you trigger a reboot once a connection attempt to the initvm fails?

I am not necessarily suggesting that the system reboots whenever elbe-daemon dies. But I have never seen a problem where starting a dead elbe-daemon actually solved the problem. Usually if elbe-deamon dies, the only thing that works is rebooting the initvm.

I have no idea why rebooting helps. Maybe there are some temp files laying around that confuse elbe-daemon.

IMHO elbe-daemon should be fixed, so that it doesn't die. ;-)

I agree that rebooting is a stronger reset that should resolve more issues. When I run ELBE initvm in CI I encountered different kinds of issues.

When I start and stop the initvm for every image build I sometimes see problems with the virtual serial interface that elbe daemon is connected to. I get errors similar to those described here which makes elbe daemon unreachable: https://unix.stackexchange.com/questions/387600/understanding-serial8250-too-much-work-for-irq4-kernel-message As this issue seems to be around for years I decided to keep the initvm running all the time. When the elbe daemon died it seemed to be no stable solution as well.

I currently have two equal initvms. I will have one running with and one without this patch and keep you updated.

koberbe-jh commented 1 year ago

Feedback

If the elbe-daemon dies, it would probably be better to have the system reboot.

That's what I do manually. Would you trigger a reboot once a connection attempt to the initvm fails?

I am not necessarily suggesting that the system reboots whenever elbe-daemon dies. But I have never seen a problem where starting a dead elbe-daemon actually solved the problem. Usually if elbe-deamon dies, the only thing that works is rebooting the initvm.

I have no idea why rebooting helps. Maybe there are some temp files laying around that confuse elbe-daemon.

IMHO elbe-daemon should be fixed, so that it doesn't die. ;-)

Hello again, in the past months I made a few more experiences with ELBE. I can confirm that restart the initvm is a better cure to many problems than restarting the daemon. This is especially the case when project builds got interrupted and the projects remain in the 'busy' state. Sometimes not even resetting and deleting the projects work. After a reboot of the initvm, the issues vanish. I decided to implemented some kind of a fallback mechanism that tries to reset and delete busy projects. If that fails, the initvm is rebooted:

#!/bin/bash

set -e

# Find all busy ELBE projects
BUSY_ELBE_PROJECTS=($(elbe control list_projects | grep -E '\/var\/cache\/elbe\/.*busy' | awk -F"\t" '{print $1}' | sed 's/\n/ /g' | grep -E '\/var\/cache\/elbe\/' || exit 0))

# Find all ELBE projects
ELBE_PROJECTS=($(elbe control list_projects | awk -F"\t" '{print $1}' | sed 's/\n/ /g' | grep -E '\/var\/cache\/elbe\/' || exit 0))

# Report missing projects (nothing can be deleted)
if [ ${#ELBE_PROJECTS[@]} -eq 0 ]
then
   echo 'No projects found. Exiting...'
   exit 0
fi

HAS_RESET_PROJECTS=false
for i in "${BUSY_ELBE_PROJECTS[@]}"
do
   :

   set +e
   # Reset project to enable deletion (busy projects can otherwise not be removed)
   echo "Resetting $i..."
   elbe control reset_project
   if [ $? -eq 0 ] ; then
      echo "Resetted $i."
   else
      echo 'First reset attempt failed. Retrying...'
      elbe control reset_project $i
      if [ $? -eq 0 ] ; then
         echo "Resetted $i."
      else
         echo 'Resetting failed.'
      fi
   fi
   set -e

   HAS_RESET_PROJECTS=true
done

for i in "${ELBE_PROJECTS[@]}"
do
   :

   set +e
   echo "Deleting $i..."
   elbe control del_project $i
   if [ $? -eq 0 ] ; then
      echo "Deleted $i."
   else
      echo 'First deletion attempt failed. Rebooting initvm to bring it back to a defined state...'
      elbe initvm stop
      elbe initvm ensure
      sleep 60

      echo "Deleting $i..."
      elbe control del_project $i
      if [ $? -eq 0 ] ; then
         echo "Deleted $i."
      else
         echo "Deletion failed."
      fi
   fi
   set -e
done

if [ "$HAS_RESET_PROJECTS" == "true" ]; then
   echo "Projects were reset. Rebooting inivm to assure defined state."
   elbe initvm stop
   elbe initvm ensure
   sleep 60
fi

I think this is no perfect situation as I'm trying to fight the symptoms instead of solving the root cause. If someone has ideas, please comment. This PRs can be closed because I couldn't proof any benefit from the change.