NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.41k stars 1.4k forks source link

Failed the last time, succeeded the next time?上一次还失败,下一次就成功了? #1856

Open zhangs-a-n opened 2 weeks ago

zhangs-a-n commented 2 weeks ago

我执行的是下面这条命令: I executed the following command:

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

第一次运行时,显示失败了,失败的原因是:

The first time I run it, the display fails because:

 RuntimeError: Error compiling objects for extension
  error: subprocess-exited-with-error

  × Building wheel for apex (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /home/zsf/anaconda3/envs/pyt231py312_2_linux/bin/python3.12 /home/zsf/anaconda3/envs/pyt231py312_2_linux/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_wheel /tmp/tmp2ijgo8p4
  cwd: /home/zsf/anaconda3/envs/pyt231py312_2_linux/apex
  Building wheel for apex (pyproject.toml) ... error
  ERROR: Failed building wheel for apex
Failed to build apex
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (apex)

其实这次还是有进步的,之前运行的那些乱七八糟的pip install命令,有的报TypeError: str,有的报No module Named torch(可我明明已经安装了pytorch了啊)。 However, this time there is an improvement. The previous pip install command was causing a mess of TypeError and No module Named torch(even though I already have pytorch installed).

第二遍时,我嫌显示的信息太多,就把-v项去了,然后等了好几分钟(10 mins?),就显示成功了,真是太扯了。 The second time, I thought there was too much information to display, so I removed the -v item, and then I waited a few minutes(10 mins?), and the display was successful.

Processing /home/zsf/anaconda3/envs/pyt231py312_2_linux/apex
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: packaging>20.6 in /home/zsf/anaconda3/envs/pyt231py312_2_linux/lib/python3.12/site-packages (from apex==0.1) (24.1)
Building wheels for collected packages: apex
  Building wheel for apex (pyproject.toml) ... done
  Created wheel for apex: filename=apex-0.1-cp312-cp312-linux_x86_64.whl size=4844829 sha256=5256a4aa59e969e609ca1ba25f616b68607eac921bde36fbff1c063a4515a570
  Stored in directory: /tmp/pip-ephem-wheel-cache-milgfajo/wheels/45/ef/09/6cfbe9deb98dfb0c3024c7fb91f389935bccbff826387be8f2
Successfully built apex
Installing collected packages: apex
Successfully installed apex-0.1
zhangs-a-n commented 2 weeks ago

我在conda虚拟环境中安装apex。 我使用的命令是: I installed apex in the conda virtual environment. The command I used was:

pip install --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

虚拟环境使用的是pytorch2.3.1,cuda_version:12.1。 The virtual environment is pytorch2.3.1, cuda_version:12.1. 然后使用的系统是Ubuntu22.04LTS。 The system used is Ubuntu22.04LTS.

安装apex时,如果指定了--config-settings "--build-option=--cpp_ext"--config-settings "--build-option=--cuda_ext",就需要安装gcc和对应虚拟环境cuda版本的cudatoolkit。cudatoolkit是安装在系统上的,不是安装在虚拟环境中。

When installing apex, if you specify --config-settings "--build-option=--cpp_ext" and --config-settings "--build-option=--cuda_ext", You need to install gcc and the corresponding virtual environment cuda version of cudatoolkit. cudatoolkit is installed on the system, not in a virtual environment.

关于cudatoolkit的安装,https://developer.nvidia.com/cuda-toolkit-archive, 一定要安装与虚拟环境cuda版本对应的cudatoolkit。 Installation of cudatoolkit https://developer.nvidia.com/cuda-toolkit-archive, virtual environment cuda version must be installed with the corresponding cudatoolkit.

下面是安装的cudatoolkit版本与虚拟环境中cuda版本不一致时会报的错误: Here are the errors that will occur when the version of cudatoolkit installed does not match the version of cuda in the virtual environment:

- [RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.3.
      In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).