DrTimothyAldenDavis / GraphBLAS

SuiteSparse:GraphBLAS: graph algorithms in the language of linear algebra. For production: (default) STABLE branch. Code development: ask me for the right branch before submitting a PR. video intro: https://youtu.be/Tj5y6d7FegI .
http://faculty.cse.tamu.edu/davis/GraphBLAS.html
Other
355 stars 62 forks source link

Fastest Way to Exfiltrate a Matrix #226

Closed victorstewart closed 1 year ago

victorstewart commented 1 year ago

the application client front end to my graph database is single threaded. so in situations where my graph database needs to fully copy each of its matrices and vectors then transfer that data over the network to create a new master or backup, this of course requires blocking application client traffic until the data is copied.

after this little test on a 2016 Intel NUC Linux box on my LAN, it seems like the quickest way will be to serialize each object? (then i can easily transfer the data over the network on another thread).

even though i'll be running on much faster server CPUs, still worried about how these serialization times might grow in time and over multiple objects. i'm considering adding 1 read-only replica in each datacenter (on top of the 3 active-active-active masters) to solely service this function.

i wish there was a "copy on write" memory allocation ability in userspace for anonymous memory!

someday i might have to consider the horrors of sharding (haha) the data and operations and then stitching them back together.

root@clr-b5df9984821e4d129387e172044f5754~ # ./graphreplication.test
    5000000x5000000 matrix with 100000000 elements consumes 2048.0MB
    milliseconds to allocate + copy the matrix: 3433ms
    milliseconds to allocate + serialize the matrix: 781ms
    size of serialized matrix: 275.6MB
#include <random>
#include <chrono>
#include <iostream>
#include <iomanip>

extern "C" {
#include <GraphBLAS/GraphBLAS.h>
}

template <typename Lambda>
static void timedTest(const char *words, Lambda&& lambda)
{
   auto start = std::chrono::system_clock::now();

   lambda();

   auto end = std::chrono::system_clock::now();
   auto duration = end - start;
   auto milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(duration).count();

   std::cout << words << milliseconds << "ms" << std::endl;
}

static float bytesToMB(size_t bytes)
{
   return static_cast<float>(bytes) / (1024 * 1024);
}

int main()
{
   std::random_device rd;
   std::mt19937 gen(rd());
   std::uniform_int_distribution<> dis(0, 1000000);

   GrB_init(GrB_NONBLOCKING);

   constexpr uint32_t matrixDimension = 5'000'000;

   GrB_Matrix A;
   GrB_Matrix_new(&A, GrB_UINT8, matrixDimension, matrixDimension);

   constexpr uint32_t nRandomElements = 100'000'000;

   for (uint32_t index = 0; index < nRandomElements; index++)
   {
      uint32_t randomRow = dis(gen);
      uint32_t randomColumn = dis(gen);

      GrB_Matrix_setElement_UINT8(A, uint8_t(1), randomRow, randomColumn);
   }

   size_t size;
   GxB_Matrix_memoryUsage(&size, A);

   std::cout << matrixDimension << "x" << matrixDimension << " matrix with " << nRandomElements << " elements consumes " << std::fixed << std::setprecision(1) << bytesToMB(size) << "MB" << std::endl;

   GrB_Matrix B;

   timedTest("milliseconds to allocate + copy the matrix: ", [&] (void) -> void {

      GrB_Matrix_dup(&B, A); 
   });

   void *blob_handle;
   GrB_Index blob_size_handle;

   timedTest("milliseconds to allocate + serialize the matrix: ", [&] (void) -> void {

      GrB_Descriptor desc;
      GrB_Descriptor_new(&desc);
      GxB_Desc_set(desc, GxB_COMPRESSION, GxB_COMPRESSION_ZSTD);

      GxB_Matrix_serialize(&blob_handle, &blob_size_handle, A, desc);
   });

   std::cout << "size of serialized matrix: " << std::fixed << std::setprecision(1) << bytesToMB(blob_size_handle) << "MB" << std::endl;

   return 0;
}
DrTimothyAldenDavis commented 1 year ago

The timings are misleading in this example. You've built the matrix with setElement, which is the slower way to do it, and it also leaves the matrix as a pile of unsorted tuples. If you want to time just the dup, you would need to put a GrB_wait outside the timer, first. Otherwise, GrB_Matrix_dup must first call GrB_wait itself on the matrix, which is doing the work of GrB_Matrix_build on the unsorted tuples from setElement.

Serialization is likely the best way to send a matrix across a network. That's what it's designed for. An alternative would be to use no compression at all, which leads to a faster serialize time, but requires more bytes to send. Another alternative would be to use GxB_Matrix_unpack to unpack the matrix in O(1) time, transmit the pieces, and then GxB_Matrix_pack it again.

victorstewart commented 1 year ago

magic!!! i knew about needing to wait for operations to execute of course but i didn't realize it applied to setElement as well.

this order of magnitude drop in time cost buys me a lot of time until i have to worry about this problem again. thanks so much.

i see now that the serialization without compression is essentially an allocate + memcpy as well. so i'll just copy the matrix on the "main thread" then serialize it with GxB_COMPRESSION_ZSTD on a worker thread.

the memory consumption of the matrix also fell to 37.6% of the original after GrB_MATERIALIZE. so that's only ~8 bytes per uint8_t element instead of the ~22 bytes i assumed before. so now the memory size differential between the serialized matrix and the GrB_Matrix is on the order of 2.8x whereas before it was on the order of 7.4x, so ~2.7x smaller. so much easier to justify overcommitting the database's memory to account for this transient memory need of copy then compress than before.

i need to preserve the matrices so the unpack way isn't an option, but its time cost would be the same as these other ways.

root@clr-b5df9984821e4d129387e172044f5754~ # ./graphreplication.test
milliseconds to GrB_MATERIALIZE: 3354ms
1000000x1000000 matrix with 100000000 elements consumes 770.5MB
milliseconds to allocate + copy the matrix: 76ms
milliseconds to allocate + serialize the matrix with compression: 776ms
size of serialized compressed matrix: 275.6MB
milliseconds to allocate + serialize the matrix NO compression: 72ms
size of serialized matrix without compression: 770.5MB
victorstewart commented 1 year ago

circling back on this. i realized the optimal solution is to fork the process to get copy-on-write matrices, and then serialize on that new process. this requires no downtime and the absolute minimal extra memory cost.