Sending security announcement emails prevents saving nodes

jenlampton commented 4 years ago

Today when we went to edit and then save the SA for today's security release, the browser hung for about 60 seconds, then the server threw a SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; error.

We tried again on the same node, and it happened again. We tried to publish the node using Bulk Operations on the content list, and that also took 60 seconds, then failed.

Other nodes on the site seem to save okay, but this one, for some reason, does not.

We then restarted MySQL on the server.

After that, We were able to create a duplicate node with the content we needed. It saved initially, but, unpublished. We edited it, set status to publish and tried to save. It would not save a second time.

The server now hung for 60 seconds, then threw a 504 Gateway Time-out error. The second time we tried to save, it, we got the same SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; error.

This new node would also not be published via Bulk Operations on the content list.

jenlampton commented 4 years ago

I found this on stack overflow:

Sometimes the process that has locked the table shows up as sleeping in the processlist! I was tearing my hair out until I killed all the sleeping threads that were open in the database in question (none were active at the time). That finally unlocked the table and let the update query run.

jenlampton commented 4 years ago

Nate killed a hanging process, but the next time we tried to save we got the 504 Gateway Time-out error again. The save after that one returned SQLSTATE[HY000]: General error: 2006 MySQL server has gone away again.

larsdesigns commented 4 years ago

I could not locate the database configuration that we had previously. I went ahead and implemented.

max_connections         = 100
connect_timeout           = 5
wait_timeout                = 600
max_allowed_packet          = 32M
thread_cache_size       = 128
sort_buffer_size        = 4M
bulk_insert_buffer_size = 16M
tmp_table_size            = 32M
max_heap_table_size       = 32M
query_cache_limit = 128K
query_cache_size = 64M
slow_query_log_file = /var/log/mysql/mariadb-slow.log
long_query_time = 10
innodb_buffer_pool_size = 256M
innodb_log_buffer_size  = 8M
innodb_file_per_table = 1
innodb_open_files = 400
innodb_io_capacity = 400
innodb_flush_method = O_DIRECT
innodb_file_format = Barracuda
innodb_file_per_table = on
innodb_large_prefix = 1
innodb_file_format_max = Barracuda

@jenlampton, Let's see how these work out.

jenlampton commented 4 years ago

@quicksketch said a lot of those were not in place when he logged in, so he set most of them up. He says he thinks mysql has been using the default settings (perhaps ever since the server move?)

Did you find his changes from today? he added innodb_flush_method = O_DIRECT for example. And he had innodb_buffer_pool_size = 256M up at 2G

jenlampton commented 4 years ago

I just tried to save the node again, still getting 504 Gateway Time-out

quicksketch commented 4 years ago

Hey @larsdesigns, I set some updated MySQL variables in /etc/mysql/mariadb.conf.d/50-server.cnf, as it seemed like we were running the mostly default configuration. Things did seem to run a bit better once I set the following:

innodb_buffer_pool_size = 2000M
innodb_log_file_size = 250M
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT

I put mysql tuner (https://github.com/major/MySQLTuner-perl) in the root home folder to run an analysis of our general configuration.

Other things we tried included restarting the entire server. And backing-up, dropping, and reimporting the entire database for backdropcmsorg. The problem of not being able to update the status of that node persists.

To test, you can try going https://backdropcms.org/node/2797/edit, setting the status to published, then trying to save. On the server, you can then run SHOW FULL PROCESSLIST and you'll see that the database is stuck while trying to UPDATE the node table.

jenlampton commented 4 years ago

@quicksketch I have an idea. Why don't we turn off php, use a mysql query to update the node status then turn php back on?

jenlampton commented 4 years ago

@larsdesigns mysql won't start anymore, I'm getting the error mysql: unknown variable 'max_connections=100'

Since the config file /etc/mysql/mariadb.conf.d/50-server.cnf still shows that value at 80, I'll need to know where you put your config?

jenlampton commented 4 years ago

@larsdesigns I found your duplicate config file at /etc/mysql/conf.d/mysql.cnf and I was able to remove all the undefined variables from it. The only two variables that remain in there now are connect_timeout and max_allowed_packet. All the others prevented from mysql from running on the command line.

After cleaning up that file I was able to disable PHP (thanks to the hint by @quicsketch), use mysql on the command line to manually alter the node and node_revision table so status is now 1, and restart PHP.

I got the node published.

quicksketch commented 4 years ago

I think I found the source of the problem, and why it affected Security Announcements but not other types of content.

Our custom module borg_mailers_security has this code in hook_node_update():

function borg_mailers_security_node_update($node) {
  _borg_mailers_security_worker($node);
}

That function mails all subscribers when a security announcement is published. It seems as though separately, the SMTP account BackdropCMS.org was using from MailJet was apparently terminated, so BackdropCMS.org was unable to send email. When the site tried to send the security email, it couldn't. Because the emails were sent in the middle of the node saving process, the node wouldn't ever completely save because the page would time out.

For the time being, I've disabled SMTP module (again) since our site can send email fine without it now that we have re-setup DKIM, reverse-proxy, SPF, DMARC, etc.

I'm going to work up some small code changes to our custom module to prevent it from completing saving the node. Ideally we'd actually switch this to either a queue process, or stop having BackdropCMS.org send the email directly and instead use a 3rd party mailing list.

larsdesigns commented 4 years ago

Can we go ahead and close this one?

jenlampton commented 4 years ago

As long as we have another issue for I'm going to work up some small code changes to our custom module open already. If we close this without that, we'll never remember to make those changes.

quicksketch commented 4 years ago

I filed PR at https://github.com/backdrop-ops/backdropcms.org/pull/672 that should fix this problem by switching to sending emails during cron jobs using a queue.

quicksketch commented 4 years ago

@serundeputy Reviewed the PR and gave it the thumbs up. Thanks Geoff!

backdrop-ops / backdropcms.org

Sending security announcement emails prevents saving nodes #670