[BUG] Data Loss during RDB load to Flash

keithchew commented 7 months ago

This is just a preliminary bug report, as I am still trying to track down the root cause. Steps are simple: (1) Start with fresh KeyDB instance (2) Load 6mil keys from RDB on startup (3) Sanity key count check for known namespaces, if count match up pass test

Repeating the steps above several times, sometimes we get missing keys.

So far, I have tracked down the issue at a high level, logs below tracks 2 namespaces which should have 10k count each. rdbLoadRio is when loading from file and bulk insert is when writing to flash. We can see that the key count is less when writing to flash:

************* rdbLoadRio count1 [1000] count2 [963]
************* rdbLoadRio count1 [1023] count2 [1000]
************* bulk insert fallback count1 [1000] count2 [963]
************* bulk insert fallback count1 [1025] count2 [1000]
************* rdbLoadRio count1 [2000] count2 [1975]
************* rdbLoadRio count1 [2024] count2 [2000]
************* rdbLoadRio count1 [3000] count2 [2988]
************* rdbLoadRio count1 [3008] count2 [3000]
************* bulk insert fallback count1 [2000] count2 [1977]
************* bulk insert fallback count1 [2023] count2 [2000]
************* rdbLoadRio count1 [4000] count2 [3964]
************* rdbLoadRio count1 [4041] count2 [4000]
************* rdbLoadRio count1 [4999] count2 [5000]
************* rdbLoadRio count1 [5000] count2 [5000]
************* bulk insert fallback count1 [3000] count2 [2988]
************* bulk insert fallback count1 [3008] count2 [3000]
************* rdbLoadRio count1 [6000] count2 [5963]
************* rdbLoadRio count1 [6036] count2 [6000]
************* bulk insert fallback count1 [4000] count2 [3964]
************* bulk insert fallback count1 [4041] count2 [4000]
************* rdbLoadRio count1 [7000] count2 [6990]
************* rdbLoadRio count1 [7007] count2 [7000]
************* rdbLoadRio count1 [8000] count2 [7975]
************* rdbLoadRio count1 [8023] count2 [8000]
************* bulk insert fallback count1 [5000] count2 [4999]
************* bulk insert fallback count1 [5000] count2 [5000]
************* rdbLoadRio count1 [9000] count2 [8971]
************* rdbLoadRio count1 [9040] count2 [9000]
************* rdbLoadRio count1 [9996] count2 [9990]
************* rdbLoadRio count1 [9997] count2 [9991]
************* rdbLoadRio count1 [9998] count2 [9991]
************* rdbLoadRio count1 [9999] count2 [9995]
************* rdbLoadRio count1 [9999] count2 [9996]
************* rdbLoadRio count1 [10000] count2 [9996]
************* rdbLoadRio count1 [10000] count2 [9997]
************* rdbLoadRio count1 [10000] count2 [9998]
************* rdbLoadRio count1 [10000] count2 [9999]
************* rdbLoadRio count1 [10000] count2 [10000]
7658:7658:M 03 Dec 2023 04:16:26.616 * DB loaded from disk: 24.729 seconds
7658:7696:M 03 Dec 2023 04:16:26.616 * Thread 0 alive.
7658:7696:M 03 Dec 2023 04:16:26.616 # DB 0: 3676161 keys (823078 volatile) in 1536 slots HT.
7658:7696:M 03 Dec 2023 04:16:26.616 # Mem level [0.129767]
************* bulk insert fallback count1 [6000] count2 [5964]
************* bulk insert fallback count1 [6036] count2 [6000]
************* bulk insert fallback count1 [7000] count2 [6991]
************* bulk insert fallback count1 [7007] count2 [7000]
7658:7696:M 03 Dec 2023 04:16:31.632 # DB 0: 4487169 keys (823078 volatile) in 1024 slots HT.
7658:7696:M 03 Dec 2023 04:16:31.632 # Mem level [0.093809]
************* bulk insert fallback count1 [8000] count2 [7975]
************* bulk insert fallback count1 [8022] count2 [8000]
7658:7696:M 03 Dec 2023 04:16:36.649 # DB 0: 5261313 keys (823078 volatile) in 1024 slots HT.
7658:7696:M 03 Dec 2023 04:16:36.649 # Mem level [0.054531]
************* bulk insert fallback count1 [9000] count2 [8971]
************* bulk insert fallback count1 [9038] count2 [9000]
7658:7696:M 03 Dec 2023 04:16:41.665 # DB 0: 5988353 keys (823078 volatile) in 1024 slots HT.
7658:7696:M 03 Dec 2023 04:16:41.665 # Mem level [0.020768]
************* bulk insert fallback count1 [9996] count2 [9990]
************* bulk insert fallback count1 [9997] count2 [9990]
************* bulk insert fallback count1 [9998] count2 [9991]
************* bulk insert fallback count1 [9999] count2 [9994]
************* bulk insert fallback count1 [9999] count2 [9996]
************* bulk insert fallback count1 [9999] count2 [9997]
************* bulk insert fallback count1 [10000] count2 [9997]
************* bulk insert fallback count1 [10000] count2 [9998]
************* bulk insert fallback count1 [10000] count2 [9999]

Will post an update here when I track this down further.

keithchew commented 7 months ago

I have tracked it down! This logic in rdb.cpp is does both loading and eviction. Performing eviction on high memory or every 1024th key is all good, but the 1024th key processing can cause missing keys.

        if (g_pserver->m_pstorageFactory && f1024thKey)
        {
            bool fHighMemory = (getMaxmemoryState(NULL,NULL,NULL,NULL) != C_OK);
            if (fHighMemory || f1024thKey)
            {
                for (int idb = 0; idb < cserver.dbnum; ++idb)
                {
                    if (g_pserver->m_pstorageFactory) {
                        g_pserver->db[idb]->processChangesAsync(this->cstorageWritesInFlight);
                        fHighMemory = false;
                    }
                }
                if (fHighMemory)
                    performEvictions(false /* fPreSnapshot*/);
            }
            g_pserver->garbageCollector.endEpoch(serverTL->gcEpoch);
            serverTL->gcEpoch = g_pserver->garbageCollector.startEpoch();
        }

I did a quick check and crudely moved the loading snippet out, ie:

        if (g_pserver->m_pstorageFactory && f1024thKey)
        {
            bool fHighMemory = (getMaxmemoryState(NULL,NULL,NULL,NULL) != C_OK);
            if (fHighMemory || f1024thKey)
            {
                // for (int idb = 0; idb < cserver.dbnum; ++idb)
                // {
                //     if (g_pserver->m_pstorageFactory) {
                //         g_pserver->db[idb]->processChangesAsync(this->cstorageWritesInFlight);
                //         fHighMemory = false;
                //     }
                // }
                if (fHighMemory)
                    performEvictions(false /* fPreSnapshot*/);
            }
            g_pserver->garbageCollector.endEpoch(serverTL->gcEpoch);
            serverTL->gcEpoch = g_pserver->garbageCollector.startEpoch();
        }
        if (g_pserver->m_pstorageFactory) {
            for (int idb = 0; idb < cserver.dbnum; ++idb)
            {
                g_pserver->db[idb]->processChangesAsync(this->cstorageWritesInFlight);
            }
        }

No more missing keys after this. This explains why I have been getting strange random errors at the application level during my evaluation of KeyDB, this is because of keys not being loaded when I try to restore an RDB to a fresh KeyDB instance.

Hope this helps anyone encountering the same issue.

splitice commented 6 months ago

This would be a cleaner implementation

if (g_pserver->m_pstorageFactory) {
  if (f1024thKey)
  {
      bool fHighMemory = (getMaxmemoryState(NULL,NULL,NULL,NULL) != C_OK);
      if (fHighMemory)
      {
          performEvictions(false /* fPreSnapshot*/);
      }
      g_pserver->garbageCollector.endEpoch(serverTL->gcEpoch);
      serverTL->gcEpoch = g_pserver->garbageCollector.startEpoch();
  }
  for (int idb = 0; idb < cserver.dbnum; ++idb)
  {
      g_pserver->db[idb]->processChangesAsync(this->cstorageWritesInFlight);
  }
}

splitice commented 6 months ago

To summarise that processChangesAsync needs to be called regardless of fHighMemory and for every key?

keithchew commented 3 months ago

I can confirm this bug is fixed in the async_flash branch. In that branch, processChangesAsync is still called every 1024th key, but endWork() ensures all writes are done. That is a better solution as writes are done in batches instead of every record.

JohnSully commented 3 months ago

Thanks @keithchew for confirming its fixed there. Sorry we worked too independly on this. I have one remaining show stopper with FLASH + FASTSYNC we are hitting which has stopped me from merging the async_flash branch but it will be rolled into the main release when its done.

We are validating this initially with a 120 node cluster here at Snap but will be ramping up the use significantly over the next 3 months so its likely quite a few more of these will get shaken out.

Will close this now as the issue is resolved.

JohnSully commented 3 months ago

For those who are diving into the code on this thread would it be helpful to have a community meeting to share notes?

keithchew commented 3 months ago

Hi @JohnSully, I am not testing yetFASTSYNC yes (that feature looks great, will likely switch to that once RDB is stable) but maybe your show stopper could be due to this?

https://github.com/Snapchat/KeyDB/issues/808

I found that during replication, the async prefetch is still being called and this is messing up the DB count. My local workaround is posted in the link, but there is probably a better way.

Snapchat / KeyDB

[BUG] Data Loss during RDB load to Flash #755