Open corbett5 opened 3 years ago
I think in principle what you're asking for would be possible via CUDA streams (I must confess to not knowing much about it), but I'm unsure how we would expose such functionality through the zfp API. Currently the only entry point we provide is through zfp_compress()
, which does a fair amount of setup work on the CPU and handles any data motion between CPU and GPU. The actual CUDA compression kernel is launched some six levels deep.
Let me discuss this with our CUDA experts to see what can be done.
I ran across this paper that seems to have tackled this problem. Not sure if their code is available.
@lindstro was this something that got a place in this release (1.0.0; release notes does not mention so)? If not, is this in works for the release later this year?
@data-panda No, this release does not include the latest CUDA and HIP work we have been doing. That will end up in the next release. Regarding CUDA streams specifically, that is not yet something our team has looked at yet. We've had discussions with others who have looked at this (see this paper, for instance) and would welcome a contribution.
@lindstro could you please share current plans regarding CUDA support in zfp? Specifically, i am interested in:
We've yet to do any work on CUDA streams and lossless compression on the GPU. It is unlikely that either would make it into the next release. The next release will, however, have CUDA and HIP support for fixed-precision and -accuracy modes.
Thanks! Can you share an ETA for next release?
I've been horrible at predicting release dates in the past and am reluctant to give false hope. That said, we're on the hook to do a release no later than end of September. I expect and hope it will happen well before then.
I have a project where we compute a time step on the GPU and then asynchronously copy some data back to the host for later use. This copy overlaps with the subsequent time step which saves a ton of time. Now I need to compress the data that we save, which I plan to do on device before copying it back to the CPU. It would be nice if this compression could also be asynchronous so I could overlap it with other computation.