WhatsApp / WhatsApp-Business-API-Setup-Scripts

The scripts related to setting up WhatsApp business API
MIT License
410 stars 433 forks source link

waweb: database.yml with multiple web app instances #15

Closed aogier closed 3 years ago

aogier commented 4 years ago

Hi, when deploying more than 1 webapp container the ones who does not create db can fail, as per /opt/whatsapp/bin/launch_within_docker.sh db creation script invocation:

# create web db and store db config database.yml in DB_CFG_DIR
/usr/bin/php ${SCRIPTS_DIR}/CreateWebDB.php ${HTTP_ROOT} ${DB_CFG_DIR}

could exit with an error:

 oggei@rtfm ~  k logs example-whatsapp-web-deployment-69c6b9bfc8-5pltg                   
Failed to create web db or store db settings to database.yml file.
An exception occurred while executing 'CREATE DATABASE example_whatsapp_waweb':

SQLSTATE[HY000]: General error: 1007 Can't create database 'example_whatsapp_waweb'; database exists
#0 /var/www/html/vendor/doctrine/dbal/lib/Doctrine/DBAL/DBALException.php(2): Doctrine\DBAL\Driver\AbstractMySQLDriver->convertException('An exception oc...', Object(Doctrine\DBAL\Driver\PDOException))
#1 /var/www/html/vendor/doctrine/dbal/lib/Doctrine/DBAL/Connection.php(2): Doctrine\DBAL\DBALException::driverExceptionDuringQuery(Object(Doctrine\DBAL\Driver\PDOMySql\Driver), Object(Doctrine\DBAL\Driver\PDOException), 'CREATE DATABASE...', Array)
#2 /var/www/html/src/WhatsApp/Database/DBConnection.php(2): Doctrine\DBAL\Connection->executeUpdate('CREATE DATABASE...', Array, Array)
#3 /var/www/html/vendor/doctrine/dbal/lib/Doctrine/DBAL/Schema/AbstractSchemaManager.php(2): WhatsApp\Database\DBConnection->executeUpdate('CREATE DATABASE...')
#4 /var/www/html/vendor/doctrine/dbal/lib/Doctrine/DBAL/Schema/AbstractSchemaManager.php(2): Doctrine\DBAL\Schema\AbstractSchemaManager->_execSql('CREATE DATABASE...')
#5 /var/www/html/src/WhatsApp/Database/DBUtil.php(2): Doctrine\DBAL\Schema\AbstractSchemaManager->createDatabase('example_whatsap...')
#6 /var/www/html/src/WhatsApp/Database/DBUtil.php(2): WhatsApp\Database\DBUtil->createDatabase(Object(WhatsApp\Database\DBConnection), 'example_whatsap...')
#7 /var/www/html/src/WhatsApp/Scripts/CreateWebDB.php(2): WhatsApp\Database\DBUtil->createDBConnection(Array)
#8 {main}

I'm unsure on why this does not happen everytime, but we are experiencing this. Given the /var/lib/whatsapp/ directory content (ssl certs apart this config) I suspect a sensible way could be to share this as a RWX volume between waweb containers, I'm trying that way and it seems to work correctly, what do you think? I think the same problem could apply either w/ k8s and docker-compose, I understand diagrams nor compose files use a multiwebapp but k8s manifests provide such a solution and it seems great, is multiwaweb option unavailable at the moment?

mengyiyuan commented 4 years ago

This is a very interesting finding. It looks like the web app container failed to start because it was trying to create the db that has already been created by a previous web app, not because it cannot write to the database.yaml file, if I read the error message correctly. As a result, I don't think making /var/lib/whatsapp/ a RWX volume should help fix the issue?

aogier commented 4 years ago

Hi Mengyi, thank you for your kind answer.

The situation as of now is: if the container fails to create the db, the script bail out with a wide catch inside /var/www/html/src/WhatsApp/Scripts/CreateWebDB.php, thus leaving the container up and running in an unhealty status ie. returning a 500 at runtime when you hit it w/ a request. Given that database.yml is only serialized by the waweb container that successfully wins the database creation, I somehow fixed this behaviour by sharing the path on which that file is written amongst the waweb containers with a ReadWriteMany pvc (maybe this is also useful because the ssl certs inside that dir, I still have to verify that). What do you think? It's a simple solution that could immediately give us redundancy on webapp!

mengyiyuan commented 4 years ago

Hi Alessandro,

webapp should always support redundancy, and anything that makes this false is a bug. :)

The thing is that in the catch block, we actually terminates the container so the container should not stay up in an unhealthy state.

Just to be safe, can you confirm which version of docker images you are using?

aogier commented 4 years ago

Hi Mengyi, I've indeed missed the exit(1) in catch block, and obviously now had an hard time reproduce this behaviour :grinning: that I got yesterday on a v2.25.5 image, anyway here it is:

log.txt

could be a missing set -e in launch_within_docker.sh script?

BTW if I delete unhealty pod and recreate it, it correctly logs Database settings stored in database.yml file so my guess is this random error originates because some difficult to observe race condition in underlying libs... what do you think?

Thank you

mengyiyuan commented 4 years ago

Hi Alessandro,

Sorry for my late reply. It is a bit difficult to pinpoint the issue, especially considering how tricky it is to reproduce the behavior.

Let me check with the team about the proposal about set -e on the script and get back to you.

aogier commented 4 years ago

Mengyi, hope you don't mind if I try to direct you toward a simple PoC.

As we said, every error in initialization phase should produce that container die (thus pod restart in k8s). I pointed out a (rather strange) db initialization error that it's rare to observe but anyway raises an error during php script execution that exit non-zero. Problem is in the shell script that invoke this php script, without set -e it does not bail out.

So, how to reproduce? It's simply a matter of launch a single waweb container instance and let if fail because there is not a mysql server:

oggei@cane ~ docker run --rm -it -e WA_DB_ENGINE=MYSQL docker.whatsapp.biz/web:v2.25.5 | tee /tmp/log.txt

I'm attaching logs, as you can see db initialization indeed fail but container stay up

log.txt

Now, if I patch the file

--- /tmp/launch_within_docker.sh        2020-02-03 11:10:36.099280234 +0100
+++ launch_within_docker.sh     2020-02-03 11:13:07.943305808 +0100
@@ -3,6 +3,8 @@
 # December 16, 2016
 #

+set -e
+
 HTTP_ROOT=/var/www/html
 HTTPD_CONF=/etc/lighttpd/lighttpd.conf
 MEDIA_ROOT=/usr/local/wamedia
 oggei@cane ~/dev/WhatsApp-Business-API-Setup-Scripts git:(feature/15-init-fix) ✗  docker run --rm -it -e WA_DB_ENGINE=MYSQL ciao              
Failed to create web db or store db settings to database.yml file.
db hostname, db username and db password are required
#0 /var/www/html/src/WhatsApp/Database/DBUtil.php(2): WhatsApp\Database\DBUtil::getDatabaseConfig(Array)
#1 /var/www/html/src/WhatsApp/Scripts/CreateWebDB.php(2): WhatsApp\Database\DBUtil::getDatabaseConfigFromEnv('waweb')
#2 {main}
 oggei@cane ~/dev/WhatsApp-Business-API-Setup-Scripts git:(feature/15-init-fix) ✗  

as you can see, container correctly dies when script fail. It is in my opinion the best option we have, given that is a rare php condition we can safely ignore and let the single instance restart.

What do you think? Hope this help :)

Goodsmileduck commented 4 years ago

@aogier Do you think it will also helps in cases when mysql is not reachable? We had a issue when mysql pod was recreated, web container got some errors in logs, but didn't restarted. Once mysql becomes available, web container start fails for about 90% of requests, until I deleted web pods, and new one started.

@mengyiyuan maybe you have any suggestions for healthcheck for waweb container?

mengyiyuan commented 4 years ago

Mengyi, hope you don't mind if I try to direct you toward a simple PoC.

As we said, every error in initialization phase should produce that container die (thus pod restart in k8s). I pointed out a (rather strange) db initialization error that it's rare to observe but anyway raises an error during php script execution that exit non-zero. Problem is in the shell script that invoke this php script, without set -e it does not bail out.

So, how to reproduce? It's simply a matter of launch a single waweb container instance and let if fail because there is not a mysql server:

oggei@cane ~ docker run --rm -it -e WA_DB_ENGINE=MYSQL docker.whatsapp.biz/web:v2.25.5 | tee /tmp/log.txt

I'm attaching logs, as you can see db initialization indeed fail but container stay up

log.txt

Now, if I patch the file

--- /tmp/launch_within_docker.sh        2020-02-03 11:10:36.099280234 +0100
+++ launch_within_docker.sh     2020-02-03 11:13:07.943305808 +0100
@@ -3,6 +3,8 @@
 # December 16, 2016
 #

+set -e
+
 HTTP_ROOT=/var/www/html
 HTTPD_CONF=/etc/lighttpd/lighttpd.conf
 MEDIA_ROOT=/usr/local/wamedia
 oggei@cane ~/dev/WhatsApp-Business-API-Setup-Scripts git:(feature/15-init-fix) ✗  docker run --rm -it -e WA_DB_ENGINE=MYSQL ciao              
Failed to create web db or store db settings to database.yml file.
db hostname, db username and db password are required
#0 /var/www/html/src/WhatsApp/Database/DBUtil.php(2): WhatsApp\Database\DBUtil::getDatabaseConfig(Array)
#1 /var/www/html/src/WhatsApp/Scripts/CreateWebDB.php(2): WhatsApp\Database\DBUtil::getDatabaseConfigFromEnv('waweb')
#2 {main}
 oggei@cane ~/dev/WhatsApp-Business-API-Setup-Scripts git:(feature/15-init-fix) ✗  

as you can see, container correctly dies when script fail. It is in my opinion the best option we have, given that is a rare php condition we can safely ignore and let the single instance restart.

What do you think? Hope this help :)

Hi Alessandro,

Sorry for my late reply, have been through some really crazy days because of the virus situation.

Thank you for following up with this issue. I've reported the issue to the team last time we talked but it was not prioritized since we cannot consistently repro the issue.

This latest comment definitely made a stronger point to solve this. Let me check with the team again and keep you posted.

Stay safe!

mengyiyuan commented 4 years ago

@aogier Do you think it will also helps in cases when mysql is not reachable? We had a issue when mysql pod was recreated, web container got some errors in logs, but didn't restarted. Once mysql becomes available, web container start fails for about 90% of requests, until I deleted web pods, and new one started.

@mengyiyuan maybe you have any suggestions for healthcheck for waweb container?

Hi Stanislav,

Sorry for my late reply. For health check for waweb, you can use the health checkpoint: https://developers.facebook.com/docs/whatsapp/api/health. You can pass in a WA_API_KEY when starting the waweb containers so that you can use the same key to make all health checks without doing the usual login flow to obtain the auth token.

Does this help?

aogier commented 4 years ago

Mengyi! Glad to hear from you, I'm so sad for what's happening and being based in Milano, Italy I sadly understand what could going on...

Thank you for considering my report and let me know if there is anything else I could do, I don't work any longer for my former employer but I'm still developing a k8s operator I'd be glad to keep on the bleeding edge :P

@Goodsmileduck : I don't think your problem fit here as the ticket is about the preliminary waweb's lifecycle steps. In your case I'd recommend to:

  1. read logs
  2. measure health endpoint @mengyiyuan told you
  3. avoid to kill a pod because another one's unhealtiness. This antipattern could typically lead to an amplification, ie. a web application is by definition stateless on the web-db segment and in no case you'd want to restart a frontend if the backend fails and in no case we should trigger a frontend restart because runtime error given by backend unavailability. Maybe the outage was bigger than observed and the deploy still had to reconverge during the usual intervals?

Thank you both, love and stay safe!