Closed smukil closed 6 years ago
I am sorry! :(
I read the comment in the code and it has indeed complicated cases. How did you debug this? Maybe a log debug around such cases help where we silently drop replication messages?
I am sorry! :(
I read the comment in the code and it has indeed complicated cases. How did you debug this? Maybe a log debug around such cases help where we silently drop replication messages?
No problem at all, typos are bound to happen sometime or the other. I think the best way to catch any form of human error is to add more automated testing. I will add a test case in an upcoming patch to test multi-buffer payloads that are communicated across DCs. As we improve our automated testing suite, bugs like this can hopefully be caught before merging them itself.
To debug this, I added quite a few changes to our test framework to try and repro this in a controlled setting, and then I used GDB to figure out what was going on.
In many cases, queries or responses that spanned multiple mbufs would not get decrypted correctly causing cross-DC traffic to be dropped silently which resulted in a failure to replicate data in different DCs.
After a lot of debugging, this was found to be due to a typo introduced in a previous commit: https://github.com/Netflix/dynomite/commit/91e8fe38ea66b79b5241f02169fa04e2ad9a50ea
This typo is now fixed and multi-buffer decryption works seamlessly.