FooDeas / raspberrypi-ua-netinst

RaspberryPi (minimal) unattended netinstaller
ISC License
220 stars 46 forks source link

Improve error handling #83

Closed dinosore closed 7 years ago

dinosore commented 7 years ago

I have recently run the latest version (1.5.2) of netinst successfully. However, you can see I had some problems on the way.

root@pie:/boot/raspberrypi-ua-netinst# ls
config                     error-20170709T174244.log  error-20170709T232709.log  error-20170710T010543.log
error-20170708T212959.log  error-20170709T174650.log  error-20170709T233024.log  error-20170710T011106.log
error-20170709T071402.log  error-20170709T175106.log  error-20170709T233420.log  error-20170710T011634.log
error-20170709T073550.log  error-20170709T184551.log  error-20170709T233907.log  error-20170710T012025.log
error-20170709T093019.log  error-20170709T184653.log  error-20170709T234627.log  error-20170710T012740.log
error-20170709T103913.log  error-20170709T184757.log  error-20170709T235104.log  error-20170710T013233.log
error-20170709T104304.log  error-20170709T184900.log  error-20170709T235526.log  error-20170710T013958.log
error-20170709T104801.log  error-20170709T185004.log  error-20170710T000200.log  error-20170710T014501.log
error-20170709T105212.log  error-20170709T220354.log  error-20170710T000647.log  error-20170710T014933.log
error-20170709T105913.log  error-20170709T220907.log  error-20170710T001616.log  error-20170710T015440.log
error-20170709T110415.log  error-20170709T221455.log  error-20170710T002154.log  error-20170710T020054.log
error-20170709T110852.log  error-20170709T221826.log  error-20170710T002702.log  error-20170710T020638.log
error-20170709T111248.log  error-20170709T222505.log  error-20170710T003357.log  error-20170710T021611.log
error-20170709T111730.log  error-20170709T225335.log  error-20170710T003829.log  raspberrypi-ua-netinst.cpio.gz
error-20170709T112359.log  error-20170709T230147.log  error-20170710T004554.log  reinstall
error-20170709T155143.log  error-20170709T230817.log  error-20170710T005102.log
error-20170709T155802.log  error-20170709T231314.log  error-20170710T005614.log
error-20170709T160340.log  error-20170709T232212.log  error-20170710T010117.log
root@pie:/boot/raspberrypi-ua-netinst# tail error-20170710T021611.log 
  P: Validating diffutils
  P: Retrieving dosfstools
  P: Validating dosfstools
  P: Retrieving e2fslibs
  wget: can't connect to remote host (155.232.191.250): Connection timed out
  E: Couldn't download pool/main/e/e2fsprogs/e2fslibs_1.42.12-2_armhf.deb!

  ERROR: 1

Error: The installation could not be completed!
root@pie:/boot/raspberrypi-ua-netinst# head error-20170710T021611.log 

==================================================
raspberrypi-ua-netinst
==================================================
Revision v1.5.2 (fa8c9a8)
Built on Fri Jun 16 00:16:39 CEST 2017
Running on Raspberry Pi version Zero W
==================================================
https://github.com/FooDeas/raspberrypi-ua-netinst/
==================================================
root@pie:/boot/raspberrypi-ua-netinst#

All the attempts listed failed in the same way - timeout downloading packages for the basic system.

On Saturday I set installer_retries to 1 and let it run and went to bed early. I hadn't come across this problem before, didn't see the need for automatic retries and wanted to check each attempt separately.

On Sunday I retried quite a few times, switching from mirrordirector to a few different UK mirrors. No joy.

I was considering setting up a mirror cache, but instead picked a mirror in South Africa, set retries to 100 and went to bed.

Monday morning the green led was out and the install had completed successfully.

So what might help make things go more smoothly?

1. Preserve cache of qownloaded packages.

I notice the packages from the last attempt are in /var/cache. Perhaps if the root partition was present, the necessary folder was in /var/cache, the list of packages hadn't changed or was sufficiently recent (say 24 hours), it wouldn't be necessary to re-download the packages.

2. Do final_action on failure when there will be no further retries.

I use the poweroff option. This is what I would like to happen on successful or unsuccessful completion. I referenced this in #72

I know you don't agree, but it would be ever so good not to have to guess when I need to take out the sd card from the pi. Perhaps have a new option error_final_action?

3. Be more patient when installing over wifi.

Longer timeout and/or retry at the package level, maybe after a delay.

4. Be more patient anyway.

I'm not sure if wifi is relevant or not.

edit: I've noticed the typo in 1,Preserve... - I've left it as I rather like it.

FooDeas commented 7 years ago

What do the Pi Zero W LEDs do if no retries are left & error occurred? (You should see a SOS...)

dinosore commented 7 years ago

There's only one led, a green activity led.

I didn't know to expect SOS until I looked at the code today.

I don't think led_sos has been called - no mention of retries on the logs

Just to make sure

root@pie:/boot/raspberrypi-ua-netinst# cat *.log|grep "The maximum number of retries is reached"
root@pie:/boot/raspberrypi-ua-netinst# cat *.log|grep "retries left"
root@pie:/boot/raspberrypi-ua-netinst# 
FooDeas commented 7 years ago
  1. Will be done.
  2. In case of error and no retry is left, the installer stops as requested in #72 and signalizes SOS via LED.
  3. Retries within the same installer call will get implemented.
  4. Not possible and complete connection retries should do the same.
dinosore commented 7 years ago

1. Preserve

I'm interested to see how you'll do this. I was thinking along the lines of

This assumes that cdebootstrap will make use of anything already in /var/cache

On the other hand, you may be thinking of something completely different.

2. Final action on failure

I borrowed your sos_led function and ran it on its own. It ran well on the Pi0W with a nice big green activity led flashing a clear SOS pattern. I'm impressed.

It wasn't so obvious on the Pi1B as the green led is much smaller and next to a big red one (power I think). Only one led sos'd. Still clear enough if peered at.

I didn't recognise the SOD at the time - perhaps I needed a big pointer to look out for it.

I convinced my self that it hadn't worked as there was nothing on the log about retries.

Then I noticed that messages about retries came after the copy of the log to the boot partition!

FooDeas commented 7 years ago
  1. Be patient. :)
  2. I'll write a bit more in the logs before the unmount.

LED-Topic: Which and how many LEDs indicate the no retries left state depends on the schematics. As far as I know, only model 3 has both LEDs controlled by the MCU.

dinosore commented 7 years ago

I have created pull request #85 to correct the flashing on Zero and Zero W.

This explains why I didn't recognise SOS on the Zero W - the led was on when it should have been off and vice-versa.

Thus, during the pause between SOS's, the led was on.

Do you want another issue for this bug?

FooDeas commented 7 years ago

You're right! The Zeros have no MOSFET in the LED circuit. I'll have a look at the code very soon!