Percona-Lab / mongodb_consistent_backup

A tool for performing consistent backups of MongoDB Clusters or Replica Sets
https://www.percona.com
Apache License 2.0
276 stars 81 forks source link

Urgent: Oplog Resolver Intermittently doesn't Complete Resolving #277

Open corey-hammerton opened 5 years ago

corey-hammerton commented 5 years ago

On our backup server we experience cases where backups fail to complete successfully because the Oplog Resolver from a previous backup didn't complete. This happens on many of our backup configurations using default oplog configuration settings.

The ResolverThreads to complete successfully, as displayed by the logs below with hostnames redacted. The resolver, however, doesn't free the threads and continues with the backup. Manually killing the defunct processes releases the locks and the MainProcess thread logs that Oplog resolving completed. Killing the processes on the server with high Virtual Memory Size achieves this.

[2018-09-15 00:23:33,726] [INFO] [PoolWorker-11] [ResolverThread:run:36] Resolving oplog for XXXXXXXXXXXXXXXXXXXXX:10000 to max ts: Timestamp(1536970998, 0)
[2018-09-15 00:23:33,796] [INFO] [PoolWorker-10] [ResolverThread:run:36] Resolving oplog for XXXXXXXXXXXXXXXXXXXXX:10001 to max ts: Timestamp(1536970998, 0)
[2018-09-15 00:23:33,828] [INFO] [PoolWorker-11] [ResolverThread:run:60] Applied 56 oplog changes to XXXXXXXXXXXXXXXXXXXXX:10000 oplog, end ts: Timestamp(1536970999, 1)
[2018-09-15 00:23:34,029] [INFO] [PoolWorker-10] [ResolverThread:run:60] Applied 205 oplog changes to XXXXXXXXXXXXXXXXXXXXX:10001 oplog, end ts: Timestamp(1536970997, 50)
[2018-09-17 02:17:23,739] [INFO] [MainProcess] [Resolver:run:142] Oplog resolving completed in 0.00 seconds

In our testing and debugging we have narrowed it down to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/master/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L105. Please Advise.

Generic server information:

OS: CentOS 7.5.1804
YUM Package Version: 1.3.0
Python Version: 2.7.5
CPU Count: 8
RAM Size: 15G
corey-hammerton commented 5 years ago

@timvaillancourt ^^

corey-hammerton commented 5 years ago

Using the pystack debugger I managed to use gdb to get a thread dump of some of our hanging processes. I've attached the dump of one of our hung threads into a file in this comment. mongodb_consistent_backup_threaddump.txt

The short version is the thread dump points to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/1.3.0/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L45

corey-hammerton commented 5 years ago

Got a dump of the parent PID of the defunct processes. the results are attached below. mongodb_consistent_backup_threaddump.txt

This one points to https://github.com/Percona-Lab/mongodb_consistent_backup/blob/master/mongodb_consistent_backup/Oplog/Resolver/Resolver.py#L105

timvaillancourt commented 5 years ago

Hi @corey-hammerton I suspect this is related to the same stalls seen in #165. I've also traced it down to thread dumps, strace, etc and cannot explain why 'multiprocessing' and/or threading is stalling.

There's a few issues on GitHub for multiprocessing that sound related to this problem but I haven't seen a clear solutions yet (that don't mean large changes like Python 3+, etc). I'm hoping someone with more familiarity with Python internals can help take this investigation further.