Apex vs Pytorch - Githubissues

StrangeTcy commented 10 months ago

Your requirements.txt leads to an installation of pytorch based on cuda 11.7, and then you also say (https://llama2-accessory.readthedocs.io/en/latest/install.html) that it'd be great to install apex from source, but apex checks for a bare metal cuda version and doesn't build anything if it detects a mismatch (in my case system-wide cuda 12.3 vs conda pytorch cuda 11.7) Is that really wise?

ChrisLiu6 commented 10 months ago

Hi! I will explain why we make some designs in the following. However, as I am not an expert on these problems, welcome to point out if I've got anything wrong.

Why do we specify torch+cu117

Take your case as an example. The version of CUDA Toolkit on your machine is 12.3. But according to this page, PyTorch doesn't seem to have released the package pre-built with this CUDA version. For PyTorch 2.0.1, the highest CUDA version supported is 11.8. Therefore, even if we make no specification on the torch CUDA version, you will still encounter the problem you are now encountering.

So, if you wanna successfully build apex from source, you have to install CUDA Toolkit in a version that is also supported by PyTorch. According to our experience, CUDA 11.7 tends out to be a feasible choice for torch 2.0.1. Note that it is actually not very troublesome to install a different version of CUDA Toolkit (https://developer.nvidia.com/cuda-toolkit-archive), and multiple versions of CUDA Toolkit can co-exist on the same machine.

linziyi96 commented 10 months ago

We also have a fallback in case apex cannot be imported so it should be okay if you choose not to install apex https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/25655b40e565d543fd7ffbbb7bc1388b7f59d432/accessory/model/components.py#L7-L53 According to our experience, the primary benefits of apex's FusedRMSNorm are: (1) save GPU memory when NOT using gradient checkpointing and (2) speed-up training when model parallel size is large (since the RMSNorm part is replicated among all model parallel workers). So if you mostly play with the smaller models (e.g., 7B or 13B) and enable gradient checkpointing (i.e., add --checkpointing to the training script), I would not expect a huge performance degradation without apex.

Alpha-VLLM / LLaMA2-Accessory

Apex vs Pytorch #105

Why do we specify torch+cu117