Avoid having all of the edxapp servers broken at the same time during a deploy.
Previous approach:
The configure app server instance(s) play runs on all edx hosts in parallel.
pip install and nodejs/webpack tasks that consume a lot of CPU run on all servers at the same time, slowing down request handling for 10-20 minutes.
if a django process happens to restart while that pip install step is running, it can run into failures due to the incomplete/inconsistent environment.
the cms and lms services are restarted at the same time on all hosts. that produces a window of about 30s to 1 minute while the django processes load where no servers can serve any requests.
This approach:
it goes through the edxapp servers one at a time (the serial: 1 line in the playbook).
it stops nginx on the server, which causes the load balancer healthcheck to fail and the instance to be removed from the load balancer pool and stop receiving traffic.
it runs the other deploy steps as normal (but only on the one server). it's not receiving any traffic so the high CPU and potentially broken python environment aren't a problem.
when it is done, it starts nginx back up, which adds it back to the load balancer pool and it starts getting traffic again
then it moves on to the next server and does the same process there.
Since we have two edxapp servers in Tahoe, this means that there should always be one server handling traffic at all times during the deploy.
The big downside to this approach is that doing them one at a time instead of in parallel adds time to the deploy. Currently, that's about 10-15 minutes. The more servers we have, the more time it will add.
To minimize that, I pulled out some of the roles that didn't need to be run serially and run those in parallel separately. That helps, but there are still a lot more things happening serially than I'd like. The issue I ran into was that some of the roles depend on variables defined in other roles. Eg, mysql_init shouldn't have to be run in series, edxapp depends on the mysql_client_cert_path variable which is defined by mysql_init. If they are not in the same play, they can't access each others' variables. Refactoring and improving those roles to eliminate the interdependencies will help us reduce the amount of work that has to be done serially.
Avoid having all of the edxapp servers broken at the same time during a deploy.
Previous approach:
configure app server instance(s)
play runs on alledx
hosts in parallel.pip install
and nodejs/webpack tasks that consume a lot of CPU run on all servers at the same time, slowing down request handling for 10-20 minutes.pip install
step is running, it can run into failures due to the incomplete/inconsistent environment.cms
andlms
services are restarted at the same time on all hosts. that produces a window of about 30s to 1 minute while the django processes load where no servers can serve any requests.This approach:
serial: 1
line in the playbook).Since we have two edxapp servers in Tahoe, this means that there should always be one server handling traffic at all times during the deploy.
The big downside to this approach is that doing them one at a time instead of in parallel adds time to the deploy. Currently, that's about 10-15 minutes. The more servers we have, the more time it will add.
To minimize that, I pulled out some of the roles that didn't need to be run serially and run those in parallel separately. That helps, but there are still a lot more things happening serially than I'd like. The issue I ran into was that some of the roles depend on variables defined in other roles. Eg,
mysql_init
shouldn't have to be run in series,edxapp
depends on themysql_client_cert_path
variable which is defined bymysql_init
. If they are not in the same play, they can't access each others' variables. Refactoring and improving those roles to eliminate the interdependencies will help us reduce the amount of work that has to be done serially.