ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.48k stars 281 forks source link

Fast memcpy, or similar of flags/u/phi buffers #150

Closed h3mosphere closed 4 months ago

h3mosphere commented 4 months ago

Hello,

First up, thank you for this fantastic software! It is quite amazing what it is capable of.

To my problem: I am looking to quickly copy the memory buffers, for the various simulation values into memory, for subsequent processing in a different thread. Currently I have the following to export TYPE_F as particles (for further processing)

vector<float3> LBM::get_particles() {
    this->flags.read_from_device();
    ThreadSafeVector<float3> particles; // defined elsewhere
    particles.reserve(this->get_N() * 3 / 4);
    parallel_for(this->get_N(), [&](uint i) {
        uchar flags = this->flags[i];
        if (flags & TYPE_F) {
            uint x, y, z = 0;
            this->coordinates((ulong) i, x, y, z);
            float3 xyz((float) x, (float) y, (float) z);
            particles.push_back(xyz);
        }
    });

    return particles.inner();
}

This however adds a fair pause in the simulation.

The basic problem, is to get the relevant data out of LBM as quickly as possible, so it can continue with it's simulation.

I was first wondering if it is possible to copy the internal memory buffer (or get a reference to it), however it also occurred to me that it may make more sense to: a) have a function to transfer the data directly from the domain devices, into a memory location, which is not contained within the LBM/Memory_Container objects. b) have the ability to 'detach' the memory buffers/Memory_Container's, and return them directly (one per domain), for further processing, and freeing in another thread.

This has the somewhat tricky problem however, looking at the Memory_Container[] index operator, and reference() functions, this has to take into account multiple LBM_Domains, when used, and interleave them appropriately. This however could be mitigated by stacking the domains in only direction (Z?), so the memory is naturally ordered linearly.

I hope this makes sense, your thoughts are much appreciated.

ProjectPhysX commented 4 months ago

Hi @h3mosphere,

you can do this quite easily: once you have copied the data from VRAM to RAM with read_from_device(), you can spawn a new detached thread on the CPU side and immediately continue the simulation on GPU. This detached thread then can take its time to do the memory copy and do other processing on the data. You only have to wait for it to finish the memory copy (use std::atomic_int variables here) before the next read_from_device() call happens and overwrites the original CPU data.

This is very similar to how I'm already doing the .png image export: here the read_from_device() + one CPU copy happen sequentially (the pause is really short, as the data is only a few MB), and then I spawn a detached thread for .png compression, which takes much longer. If you export a lot of 4K images in rapid succession, you'll notice CPU load going to 100%, when all cores are busy doing .png compression, each on one image.

Kind regards, Moritz