[BUG] Downloading newer EOS images causes OOM on smaller switches

Hi there,

When using the install_image action on ZTP'ing a brand new switch I'm getting OOM errors on Arista 7010 OOB switches. This started happening around EOS version 4.26.3M, but it can also be early then that, I didn't try all versions. My guess is that this is cause by the incremental growth of the EOS image size with each version and the fact that the ZTPserver uses a plain GET request to download the given EOS image. As far as I know by default Python request loads the response into memory first, before writing to disk and because the process is memory restricted this causes the process to be OOM killed.

See example error output below;

[admin@localhost flash]$ free -h
             total       used       free     shared    buffers     cached
Mem:          3.8G       2.6G       1.2G         0B        50M       806M
-/+ buffers/cache:       1.8G       2.0G
Swap:           0B         0B         0B
[admin@localhost flash]$ free -h
             total       used       free     shared    buffers     cached
Mem:          3.8G       2.6G       1.2G         0B        50M       806M
-/+ buffers/cache:       1.8G       2.0G
Swap:           0B         0B         0B
[admin@localhost flash]$ free -h
             total       used       free     shared    buffers     cached
Mem:          3.8G       2.6G       1.2G         0B        50M       806M
-/+ buffers/cache:       1.8G       2.0G
Swap:           0B         0B         0B
[admin@localhost flash]$ free -h
             total       used       free     shared    buffers     cached
Mem:          3.8G       2.6G       1.2G         0B        50M       806M
-/+ buffers/cache:       1.8G       2.0G
Swap:           0B         0B         0B
[admin@localhost flash]$ free -h
             total       used       free     shared    buffers     cached
Mem:          3.8G       2.7G       1.1G         0B        50M       806M
-/+ buffers/cache:       1.8G       2.0G
Swap:           0B         0B         0B
[admin@localhost flash]$ free -h
             total       used       free     shared    buffers     cached
Mem:          3.8G       2.7G       1.1G         0B        50M       806M
-/+ buffers/cache:       1.8G       2.0G
Swap:           0B         0B         0B
[admin@localhost flash]$ [81002.319709] Out of memory: Kill process 21008 (ReloadCauseAgen) score 919 or sacrifice child
Jan  5 14:11:19 localhost kernel: [81002.319709] Out of memory: Kill process 21008 (ReloadCauseAgen) score 919 or sacrifice chil[81002.546948] Killed process 21008 (ReloadCauseAgen) total-vm:454076kB, anon-rss:18704kB, file-rss:62344kB, shmem-rss:24kB
[81002.546948] memory usage:1.9% score:917130 oom_score_adj:900
d
Jan  5 14:11:19 localhost kernel: [81002.546948] Killed process 21008 (ReloadCauseAgen) total-vm:454076kB, anon-rss:18704kB, file-rss:62344kB, shmem-rss:24kB
Jan  5 14:11:19 localhost kernel: [81002.546948] memory usage:1.9% score:917130 oom_score_adj:900
[81003.194389] Out of memory: Kill process 21162 (python) score 526 or sacrifice child
Jan  5 14:11:20 [81003.289228] Killed process 21162 (python) total-vm:2193648kB, anon-rss:2152380kB, file-rss:8368kB, shmem-rss:0kB
[81003.289228] memory usage:52.6% score:525019 oom_score_adj:0
localhost kernel: [81003.194389] Out of memory: Kill process 21162 (python) score 526 or sacrifice child
Jan  5 14:11:20 localhost kernel: [81003.289228] Killed process 21162 (python) total-vm:2193648kB, anon-rss:2152380kB, file-rss:8368kB, shmem-rss:0kB
Jan  5 14:11:20 localhost kernel: [81003.289228] memory usage:52.6% score:525019 oom_score_adj:0
Jan  5 14:11:20 localhost ZeroTouch: %ZTP-4-EXEC_SCRIPT_SIGNALED: Config script exited with an uncaught signal. Signal code: Killed
Jan  5 14:11:20 localhost ZeroTouch: %ZTP-6-RETRY: Retrying Zero Touch Provisioning from the beginning (attempt 2423)

A solution would be to write the image to disk in chunks using request streaming instead of downloading the whole image into memory.

arista-eosplus / ztpserver

[BUG] Downloading newer EOS images causes OOM on smaller switches #373