BIDData / BIDMach

CPU and GPU-accelerated Machine Learning Library
BSD 3-Clause "New" or "Revised" License
916 stars 168 forks source link

Pascal GPU and bidmach #104

Open ghost opened 8 years ago

ghost commented 8 years ago

Hello,

Just wondering if Bidmach will be comptible with the new Pas al nvidia GPU with unified memory ? Maybe unified memory would reduce the latent time for transfering the data.

Kind regards. Arita

mattmacy commented 8 years ago

The changes to unified memory in CUDA 8 allow one to oversubscribe the GPU memory. Pascal will actually be able to handle page faults, presumably introducing some manner of LRU/MFU page replacement policy. That will make it much easier to work with data sets much larger than the memory on the card. Particularly for data sets that some element of random access. Nonetheless, your steady state working set will need to fit in the cards memory or your memory bandwidth will converge to the speed of your PCIe link.

My general understanding is that BIDMach handles most applications through mini batch management well enough that it wouldn't really directly have any need for this.

jcanny commented 8 years ago

Unified memory has been in NVIDIA GPUs since CUDA 6. There's actually code in BIDMat to use unfied memory on GMat's that wont fit in the current GPUs memory, since there's no other way to allocate them. However, performance is really bad. The problem is that the data has to go across the PCI bus. Not only that, but each page has to fault before a copy starts. The bottom line is that we found that using the current unified memory was much slower than normal CPU-GPU copy of large blocks (5-10 GB/s).

Pascal doesnt introduce unified memory but it adds the technology to (potentially) make it work well. Namely it has a fast interconnect (NVLINK) between the CPU, CPU memory and GPU, if you have the right motherboard. Note that having one of these special motherboards is critical to leveraging unified memory. They should be out soon, given that Pascal just made an appearance at GTC.

Its still not completely clear that unified memory (basically an L4 cache) is the best way to use NVLINK bandwidth for machine learning. If you're training big deep nets, normal main memory is actually fast enough to be swapped in and out of the GPU memory through NVLINK, but not through the PCI bus. But in that scenario, you know exactly when and what memory you're going to need and its very easy to prefetch/swap it out. Unified memory's caching scheme might be smart enough to come close to that, but its very important it be smart about prefetfching contiguous blocks and not just waiting for them to fault.

PS For very large DNNs, or if you want to train ensembles of DNNs, NVLINKed memory is likely to be a big boost. Not for the data coming in but for the model params and the blocks of tensor data passed between the layers.

mattmacy commented 8 years ago

Unified memory has been in NVIDIA GPUs since CUDA 6. There's actually code in BIDMat to use unfied memory on GMat's that wont fit in the current GPUs memory, since there's no other way to allocate them. However, performance is really bad. The problem is that the data has to go across the PCI bus. Not only that, but each page has to fault before a copy starts. The bottom line is that we found that using the current unified memory was much slower than normal CPU-GPU copy of large blocks (5-10 GB/s).

The bottom line is more or less what I'd expect. However CUDA 8 does actually greatly extend what one can do with unified memory. Previously it appears that it was primarily geared towards simplifying programming and did not support oversubscription. Unified memory in Pascal allows one to address up to 2TB as well as providing madvise API for ranges of memory to give access type hints. See the "CUDA 8 and Beyond" starting at 3:13: http://on-demand.gputechconf.com/gtc/2016/video/S6224.html

Current performance behavior leaves much to be desired but I think that supporting automatic pre-fetch for extended sequential accesses could greatly improve performance on workloads that had large sequential access components.

Pascal doesnt introduce unified memory but it adds the technology to (potentially) make it work well. Namely it has a fast interconnect (NVLINK) between the CPU, CPU memory and GPU, if you have the right motherboard. Note that having one of these special motherboards is critical to leveraging unified memory. They should be out soon, given that Pascal just made an appearance at GTC.

A couple of caveats:

There are planned improvements on the x86 front, but it's anybody's guess how well their respective owners will execute. Intel is 4 years behind on PCI-e gen4 but is promoting Omnipath as their next-gen interconnect. My general feeling is that the only thing Intel is any good at right now is not messing up its microarchitecture (a la Pentium IV and Bulldozer) and continued process shrinks. AMD has its own 100GB/s interconnect that it will use on its Zeppelin APUs - but there is no discussion of when that will become more widely available.

Good Luck.

jcanny commented 8 years ago

No, the older unified memory model does support oversubscription. Try it! You can allocate a GMat that is bigger than the GPU's memory capacity. It will work, but swapping is very slow. I believe they could do a much better job re: cache strategy, but if it still has to go over a PCIE bus, then realistically its not going to be that useful.

You're right about the launch dates for the Power-based boards. That's disappointing.

mattmacy commented 8 years ago

No, the older unified memory model does support oversubscription. Try it! You can allocate a GMat that is bigger than the GPU's memory capacity.

@jcanny I haven't used it myself so I'm only going on how I've parsed their slides. The way they described it at the presentation was that one could only use as much host memory as was on the card - i.e. giving one 2x. You're able to use arbitrarily large allocations?

jcanny commented 8 years ago

Host means the host PC. Memory is allocated on the host CPU and then swapped as needed to the GPU. So you can allocate as much as your PC has. See the docs for "cudaMallocHost". You can treat unified memory pointers like other GPU memory pointers but accessing them might involve swapping.

mattmacy commented 8 years ago

Host means the host PC. Memory is allocated on the host CPU and then swapped as needed to the GPU. So you can allocate as much as your PC has. See the docs for "cudaMallocHost". You can treat unified memory pointers like other GPU memory pointers but accessing them might involve swapping.

Then I'm genuinely confused about the specifics of what's actually different in CUDA 8. I'll have to pester Mark Harris when he posts about it on "Parallel Forall".

Thanks.

mattmacy commented 8 years ago

https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/#more-6554

Perhaps they've just extended the amount of addressable memory and added coherency support in NVLINK. Not completely clear to me.