Package Request: python-pytorch-all

iamhumanipromise commented 1 year ago

Case for package:

_1. Many Haswell+ laptops also have a dedicated nVidia GPU in addition to their Intel Gen8+ GPU. These laptops also have Thunderbolt, and to save money many encountered at universities and by a client in California with the eGPU use a Radeon card. (This makes it "easier" to use the eGPU on a Mac for dev work)

This multi-architecture approach would also be a "laptop-sized" example of a mixed-computing environment in a high-performance setting using mixed architectures in a on-prem data center, or in mixed cloud-computing environments. So is a microcasm of "real-world, high performance computing usage" that students do not have access to until graduate programs. (Even then, it is for a limited amount of time.)_

Package Description:

Tensors and Dynamic neural networks in Python with strong GPU acceleration (with TensorRT, CUDA, ROCm, OneAPI-DNN (MKL-DNN), ZenDNN and AVX2 CPU optimizations)

This package will then permit function of:

With CUDA 9 or 11 all nVIDIA sm20, sm3x cards (Fermi, Kepler: Notable cards include the 6GB Quadro 6000, Quadro 7000, 2x 6GB Quadro Plex 7000; 2x 12GB Tesla K80, 12GB Tesla K40, the 6GB cards below that; also the 6GB Titan and Titan Black and 2x 6GB Titan Z.
With CUDA 11.6 all sm_4x cards (Maxwell: GTX 980, etc)
With CUDA 12 all sm_5x cards and above.
TensorRT would work with CUDA 11.6+ and sm4x+ cards for now with 8.6.0, need to use the previous version.
ROCm would support all Polaris GFX804+ GCN4 cards such as the Sapphire 16GB RX 570 or 2x 16GB Radeon Pro Duo, Vega, and above.
OneAPI/Level Zero would likely be through OpenVINO which would also support the Movidius Neural Compute Stick and Compute Stick 2, along with the cores coming in the new 14th gen CPUs.
MKL-DNN targets OneAPI MKL libraries for Intel CPUs.
ZENDNN targets Intel Zen Epyc and Ryzen+Threadripper CPUs.
Generic AVX2 optimizations target all CPUs with AVX2 instructions
PyOptix and PyTorch 3D should provide what is needed for additional Ray Tracing core support.

With this package: most CPUs and CPUs made since 2012/2013 will be able to provide student and researcher usage for AI with such a package.

Real World Example Machine #1

Intel Core i7-9750H mobile CPU (6 cores, 12 threads, 41.8 GB/s max memory speed @2666MHz)
Intel UHD 630 CFL GT2 (24EUs w/2x 128-bit FPUs each: 48x total floating point units, up to 1.15GHz, up to 64GB of shared memory usage. Realistic 16-32GB shared DDR4)
nVidia RTX 2080 Max-Q (8GB GDDR5, 2944x CUDA cores, 386x Tensor Cores and 46x Ray Tracing Units)
USB: 1x Movidius Neural Compute Stick (12x Myriad 2 shave cores*, 4GB RAM on-board)
USB: 1x Intel Movidius Neural Compute Stick 2 w/4GB RAM on-board (16x Myriad X shave cores, 4GB RAM on-board)
ThunderBolt eGPU: Sapphire Radeon RX 570 "Mining Edition": (2x 8GB GDDR5 w/total 64x Compute Units (CUs) or 4096x shaders

Real World Example Machine #2

AMD Ryzen 7700X CPU (8 Cores, 16 Threads w/AVX12 Support)
Intel Arc A770 (16GB GDDR6, 32 Xe Cores, 32 RT Units, 512 XMX Engines)
AMD Radeon VII (16GB HBM2, 3840 stream processors/shaders equaling 60x CUs, 64 "ROPs" and 240 "Texture Mapping Units"
nVidia K80: using 3D printed workstation fan adapter (2x 12GB VRAM, 2x 2496 CUDA cores, 2x 12GB GDDR5)

Each one of these could benefit from a bundled package such as the one mentioned.

Machine 1 = 5624 Cores of which 28 Cores are VPU, 6 Cores are CPU, 368 are Tensor and 46 are RT with the remainder GPU/CUDA cores.

Machine 2 = 9440 Cores of which 8 are CPU cores, 32 are RT cores, 512 are XMX cores and the rest are GPU/CUDA cores.

(Note: A Movidius VPU shave core is a 128-bit VLIW vector processor that can perform parallel computations on image and video data.)

petronny commented 1 year ago

Tensors and Dynamic neural networks in Python with strong GPU acceleration (with TensorRT, CUDA, ROCm, OneAPI-DNN (MKL-DNN), ZenDNN and AVX2 CPU optimizations)

Great idea but I'm not sure if it's possible to compile pytorch with both CUDA and ROCM enabled.

With CUDA 9 or 11 all nVIDIA sm20, sm3x cards (Fermi, Kepler: Notable cards include the 6GB Quadro 6000, Quadro 7000, 2x 6GB Quadro Plex 7000; 2x 12GB Tesla K80, 12GB Tesla K40, the 6GB cards below that; also the 6GB Titan and Titan Black and 2x 6GB Titan Z.

With CUDA 11.6 all sm_4x cards (Maxwell: GTX 980, etc)

With CUDA 12 all sm_5x cards and above.

ROCm would support all Polaris GFX804+ GCN4 cards such as the Sapphire 16GB RX 570 or 2x 16GB Radeon Pro Duo, Vega, and above.

Old GPUs may only work with their corresponding old CUDA/ROCM libraries. But even if these different CUDA/ROCM libraries do co-exist in the system, I don't think the compiler can use all of them. For example, in the code we have #include <cuda.h>, the compiler will only use the first found cuda.h. Besides, old GPUs may not be officially supported by the latest pytorch neither.

And I don't think different libraries co-existing in the system fits the philosophy of arch linux. It sounds more like gentoo.

Back to your situations, I think they are the ideal for virtual enviroments, anaconda or even docker. Just create an enviroment for each kind of your computing resources on your machine and install the right version of pytorch then somehow provide a unified interface to access the pytorch of them. For example, the users can run their code like:

enviroment=cuda # Or rocm, cuda8, cpu_only or any enviroment
source /path/to/$enviroment/bin/activate
python train.py

iamhumanipromise commented 1 year ago

Thank you for the advice regarding how to proceed!

I’m also thinking of using SHARKSHARK instead to use the iGPU, dGPU, CPU together instead of containerized apart.

The SHARK approach may be a little more difficult given lack of documentation (but should be a cleaner solution! — will have to request a package for that if it works!! ;-)

iamhumanipromise commented 1 year ago

This being said - is it possible to have a package for PyTorch and TensorFlow that is “opt-ROCM” or “opt-CUDA” but the “opt” part includes both Intel CPUs & GPUs & Myriad VPUs?

Aka the behavior OPENVINO has today (yet another one without a package!)

iamhumanipromise commented 1 year ago

Summary so far: these are two HSA machines which need local virtual environments and a new backend to schedule/coordinate/orchestrate/facilitate the distribution of neural workloads. (Or another clever solution)

I have tried to use various existing techniques to expose virtualized environments to each other to no avail and am again limited by this lack of the "neural fabric" interface for multiple virtual environments to distribute and coordinate the processing of jobs simultaneously. I also do not yet have the Python skills to create this.

I have opened an issue on the Easy Diffusion project Github.

I have enrolled in an MIT 12-Week crash course in AI but I believe all that will do is give me a better overview before diving deeper. It will be awhile before I am able to develop the strategy, though the dev environment is lying in wait!

arch4edu / arch4edu

Package Request: python-pytorch-all #222