(Future release) building with GDS

jakirkham commented 2 months ago

Assuming a user's workflow is able to be run entirely on the GPU, the remaining pieces that can be slow are file IO. Of course file IO can be slow for its own reasons. However the thing of interest here is it is slow to read into host memory and then transfer to device (especially if this is a big chunk of data)

To address this, NVIDIA rolled out GPUDirect Storage (publicized in this blogpost and covered in these docs). Basically the idea is to go directly from file IO to GPU memory (or back) bypassing host memory (and thus the expensive transfer cost associated). There is a bit of setup to get this to work, but it can be valuable for larger data workflows

To see this in action, would recommend reading this blogpost about using Xarray and KvikIO (a RAPIDS library leveraging GPUDirect Storage) to load Zarr-based climate data into Xarray (using CuPy on the backend)

Recently PyTorch has started adding support for GPUDirect Storage with PR ( https://github.com/pytorch/pytorch/pull/133489 ). This is not in any release yet. Though it is already merged

Once it is in a release, we could enable this here by adding libcufile-dev to requirements/host and setting the CMake option USE_CUFILE to 1

So for now simply raising awareness about this upcoming feature

ngam commented 2 months ago

@jakirkham slightly unrelated: what would it take to bring KvikIO to conda-forge? More generally, I remember there was a plan to move packages from the rapidsai channel to conda-forge back in the day (or maybe I am misremembering?) Searching for "KvikIO" on anaconda org: https://anaconda.org/search?q=KvikIO

mgorny commented 7 hours ago

FWICS it's in 2.5.1 already. I guess I'll try tackling that next.

conda-forge / pytorch-cpu-feedstock

(Future release) building with GDS #257