Closed koberbe-jh closed 2 years ago
Can you please provide a log of the error?
If the elbe-daemon dies, it would probably be better to have the system reboot.
Can you please provide a log of the error?
I'm sorry, but log rotation already removed it. I will post it once I see the errors again.
As far as I remember the process was killed because there was a lack of memory. So I increased the memory of the initvm. The 'restart on failure' is meant to be an additional mitigation strategy.
If the elbe-daemon dies, it would probably be better to have the system reboot.
That's what I do manually. Would you trigger a reboot once a connection attempt to the initvm fails?
If the elbe-daemon dies, it would probably be better to have the system reboot.
That's what I do manually. Would you trigger a reboot once a connection attempt to the initvm fails?
I am not necessarily suggesting that the system reboots whenever elbe-daemon dies. But I have never seen a problem where starting a dead elbe-daemon actually solved the problem. Usually if elbe-deamon dies, the only thing that works is rebooting the initvm.
I have no idea why rebooting helps. Maybe there are some temp files laying around that confuse elbe-daemon.
IMHO elbe-daemon should be fixed, so that it doesn't die. ;-)
If the elbe-daemon dies, it would probably be better to have the system reboot.
That's what I do manually. Would you trigger a reboot once a connection attempt to the initvm fails?
I am not necessarily suggesting that the system reboots whenever elbe-daemon dies. But I have never seen a problem where starting a dead elbe-daemon actually solved the problem. Usually if elbe-deamon dies, the only thing that works is rebooting the initvm.
I have no idea why rebooting helps. Maybe there are some temp files laying around that confuse elbe-daemon.
IMHO elbe-daemon should be fixed, so that it doesn't die. ;-)
I agree that rebooting is a stronger reset that should resolve more issues. When I run ELBE initvm in CI I encountered different kinds of issues.
When I start and stop the initvm for every image build I sometimes see problems with the virtual serial interface that elbe daemon is connected to. I get errors similar to those described here which makes elbe daemon unreachable: https://unix.stackexchange.com/questions/387600/understanding-serial8250-too-much-work-for-irq4-kernel-message As this issue seems to be around for years I decided to keep the initvm running all the time. When the elbe daemon died it seemed to be no stable solution as well.
I currently have two equal initvms. I will have one running with and one without this patch and keep you updated.
Feedback
If the elbe-daemon dies, it would probably be better to have the system reboot.
That's what I do manually. Would you trigger a reboot once a connection attempt to the initvm fails?
I am not necessarily suggesting that the system reboots whenever elbe-daemon dies. But I have never seen a problem where starting a dead elbe-daemon actually solved the problem. Usually if elbe-deamon dies, the only thing that works is rebooting the initvm.
I have no idea why rebooting helps. Maybe there are some temp files laying around that confuse elbe-daemon.
IMHO elbe-daemon should be fixed, so that it doesn't die. ;-)
Hello again, in the past months I made a few more experiences with ELBE. I can confirm that restart the initvm is a better cure to many problems than restarting the daemon. This is especially the case when project builds got interrupted and the projects remain in the 'busy' state. Sometimes not even resetting and deleting the projects work. After a reboot of the initvm, the issues vanish. I decided to implemented some kind of a fallback mechanism that tries to reset and delete busy projects. If that fails, the initvm is rebooted:
#!/bin/bash
set -e
# Find all busy ELBE projects
BUSY_ELBE_PROJECTS=($(elbe control list_projects | grep -E '\/var\/cache\/elbe\/.*busy' | awk -F"\t" '{print $1}' | sed 's/\n/ /g' | grep -E '\/var\/cache\/elbe\/' || exit 0))
# Find all ELBE projects
ELBE_PROJECTS=($(elbe control list_projects | awk -F"\t" '{print $1}' | sed 's/\n/ /g' | grep -E '\/var\/cache\/elbe\/' || exit 0))
# Report missing projects (nothing can be deleted)
if [ ${#ELBE_PROJECTS[@]} -eq 0 ]
then
echo 'No projects found. Exiting...'
exit 0
fi
HAS_RESET_PROJECTS=false
for i in "${BUSY_ELBE_PROJECTS[@]}"
do
:
set +e
# Reset project to enable deletion (busy projects can otherwise not be removed)
echo "Resetting $i..."
elbe control reset_project
if [ $? -eq 0 ] ; then
echo "Resetted $i."
else
echo 'First reset attempt failed. Retrying...'
elbe control reset_project $i
if [ $? -eq 0 ] ; then
echo "Resetted $i."
else
echo 'Resetting failed.'
fi
fi
set -e
HAS_RESET_PROJECTS=true
done
for i in "${ELBE_PROJECTS[@]}"
do
:
set +e
echo "Deleting $i..."
elbe control del_project $i
if [ $? -eq 0 ] ; then
echo "Deleted $i."
else
echo 'First deletion attempt failed. Rebooting initvm to bring it back to a defined state...'
elbe initvm stop
elbe initvm ensure
sleep 60
echo "Deleting $i..."
elbe control del_project $i
if [ $? -eq 0 ] ; then
echo "Deleted $i."
else
echo "Deletion failed."
fi
fi
set -e
done
if [ "$HAS_RESET_PROJECTS" == "true" ]; then
echo "Projects were reset. Rebooting inivm to assure defined state."
elbe initvm stop
elbe initvm ensure
sleep 60
fi
I think this is no perfect situation as I'm trying to fight the symptoms instead of solving the root cause. If someone has ideas, please comment. This PRs can be closed because I couldn't proof any benefit from the change.
The ELBE daemon service might crash due to internal errors or resource shortages. At the moment the service does not recover from this state. The change makes the service restart when its execution failed. This is especially helpful when running the ELBE initvm in automated environments such as CI/CD infrastructures.