catchpoint / WebPageTest.agent

Cross-platform WebPageTest agent
Other
213 stars 137 forks source link

Bootstrapping a new agent - apt is locked #378

Open mosheavni opened 4 years ago

mosheavni commented 4 years ago

Hi, When launching agents with the AWS AMIs, the agent starts bootstrapping the same time when apt daily updates are run, this causes new agents being caught in a loop of:

E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

Until the daily updates are finished. Sometimes, this causes new agents to take up to 30 minutes to start getting work! I wonder, if there's any chance of implementing this fix into the agents' bootstrap script? https://unix.stackexchange.com/a/315517/150289

#!/bin/bash

systemctl stop apt-daily.service
systemctl kill --kill-who=all apt-daily.service

# wait until `apt-get updated` has been killed
while ! (systemctl list-units --all apt-daily.service | egrep -q '(dead|failed)')
do
  sleep 1;
done

# Do the agent's bootstrapping
pmeenan commented 4 years ago

The agent is just waiting for the daily service to finish updating before it tries to update as well. Disabling the daily service would just move the slow part to be the agent's update.

The main issue is that the AMI's are fairly old at this point so the auto-update when they first spin up (to make sure they have all the latest security fixes and browsers) can take a really long time.

mosheavni commented 4 years ago

I understand, I can help with creating a packer manifest for the AMIs, for that, I need to know what's included in the AMIs. @pmeenan , can you point me in the right direction? Thanks.

mosheavni commented 3 years ago

@pmeenan Please let me know how I can build the image with the latest apt update, We really need the agents to spin up faster. Thanks!

pmeenan commented 3 years ago
abarre commented 3 years ago

For your information, we discovered that we have the exact same issue. A lot of our agents are terminated after the IdleTerminateMinutes (set to 10 minutes) delay without taking any tasks.

When I look at the agent, I can see that the agent is blocked at the initialization pahse with the error in loop :

E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
mosheavni commented 3 years ago

For your information, we discovered that we have the exact same issue. A lot of our agents are terminated after the IdleTerminateMinutes (set to 10 minutes) delay without taking any tasks.

When I look at the agent, I can see that the agent is blocked at the initialization pahse with the error in loop :

E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

so how did you overcome this?

abarre commented 3 years ago

We created an AMI from an agent after the init phase as suggested by @pmeenan. This works but it is not ideal.

This creates another bug now. Since, we change the ec2_locations.ini file in the server to reference the new AMI, the server auto-update doesn't work. In the file cron/hourly.php, the command git pull origin release fails with the following error :

Cannot pull with rebase: You have unstaged changes.
Please commit or stash them.
pmeenan commented 3 years ago

If you turn off "ec2_locations" in settings.ini it will only read from locations.ini instead of merging-in ec2_locations.ini. Copying your customized ec2_locations.ini to locations.ini should work.