Open hmaarrfk opened 3 years ago
USE_SYSTEM_XNNPACK
seems to be the most beneficial.
do we have xnnpack feedstock? do you think it is safe to use it since it is "undocumented" in their setup.py file?
It's documented at https://github.com/pytorch/pytorch/blob/e318058ffe662d426617a405fb21e6470dfc1219/CMakeLists.txt#L369, so I think it's fine to use.
No, we don't have a xnnpack feedstock, but it shouldn't be too hard.
unfortunately, xnnpack doesn't have any versions. at least none that i can find glancing at the Google GitHub page for it.
seems like we would be creating version numbers. which I'm not a fan of.
We can go for a date based version numbering. It's not ideal, but better than building pytorch manually.
Debian has versions like 0.0~git20201221.e1ffe15
. We can go with something like 0.0.2020.12.21.e1ffe15
would we create sacred versions for the ones pytorch pins to?
Yes, we should create versions needed by pytorch. For eg: pytorch 1.9.0 uses 55d53a4e7079d38e90acd75dd9e4f9e781d2da35,
so 0.0.0.2021.02.23.55d53a4
and have run_exports
that pin exactly.
Nvm, xnnpack will save only a few minutes.
Ok thanks.
NCCL seems to take 20-30 mins to build. That might just throw us under the 6 hours. Lets see.
The other option I want to try first is to use mamba but after NCCL results.
As expected, XNNPACK is quick to compile https://github.com/conda-forge/staged-recipes/pull/15865
@hmaarrfk @isuruf @IvanYashchuk, I'm going to help with reducing the building time of PyTorch on the conda-forge CIs, could you please let me know which strategies have been already tried, even unsuccessful?
honestly, where i was at next was going to try to separate out the common CPU code and make that into a library. however, if we can't get the the GPU builds to finish, and those seem to take time since they target a wide class of GPUs, it seems a little moot.
I'm not too excited to depart from upstream build procedures.
So, in that order of ideas the idea would be to strip ATen and these sort of core libraries apart?
Splitting off libraries is not going to help much with the time taken to build the GPU packages because compiling the GPU code takes a lot of time due to the number of GPU architecture we compile for.
Current Azure Pipelines are limited to 6h. Because of that currently, the packages are manually built and uploaded, which is not ideal. There were attempts to compile less code, for example switching off Caffe2 builds https://github.com/conda-forge/pytorch-cpu-feedstock/pull/64 or prunning old CUDA architectures https://github.com/conda-forge/pytorch-cpu-feedstock/pull/47/commits/eb19415cfd620fa500bb2256176264be77f16efa.
I think the current best way to achieve semi-automated uploads using Azure Pipelines is to utilize ccache or sccache and save the caching directory using Azure's Cache@2
task. I'm testing it here. Two caveats there:
Looking at the last commit from master 6 hours in Linux CUDA builds gets us around [4381/4911] or 89% files compiled. Two consecutive runs (the second one needs to be manually triggered) would most likely finish the builds.
I've found that the last few 100 files are the ones that take the longest time to compile....
On a powerful machine, Threadripper 2 + 128 GB RAM, here are a few timings:
mark@ostrich $ ls build_artifacts/*build* -lahtcr
-rw-r--r-- 1 mark mark 0 Oct 23 10:21 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.0cudnn8numpy1.18python3.8.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 12:09 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version10.2cudnn7numpy1.19python3.9.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 14:50 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.2cudnn8numpy1.18python3.7.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 17:10 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.1cudnn8numpy1.18python3.7.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 19:27 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.1cudnn8numpy1.18python3.8.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 21:14 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version10.2cudnn7numpy1.18python3.8.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 23:16 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.0cudnn8numpy1.18python3.7.____cpython
-rw-r--r-- 1 mark mark 0 Oct 24 01:55 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.2cudnn8numpy1.18python3.8.____cpython
-rw-r--r-- 1 mark mark 0 Oct 24 03:43 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version10.2cudnn7numpy1.18python3.7.____cpython
-rw-r--r-- 1 mark mark 0 Oct 24 05:44 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.0cudnn8numpy1.19python3.9.____cpython
-rw-r--r-- 1 mark mark 0 Oct 24 08:23 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.2cudnn8numpy1.19python3.9.____cpython
You can see that cuda compiler 11.2 takes about 2:40 mins, while other builds take about 2 hours.
Regarding caching, is the total job limit 6 hours, or is each step limited to 6 hours?
Regarding caching, is the total job limit 6 hours, or is each step limited to 6 hours?
6 hours is the total job limit.
I get that this comes at a cost, I just wanted to list these out in case they can help us get down to a below 6 hour build time:
I found these variables in the
cmake/Depenencies.cmake
USE_SYSTEM_PTHREADPOOL
USE_SYSTEM_CPUINFO
USE_SYSTEM_XNNPACK
USE_SYSTEM_FP16
USE_SYSTEM_EIGEN_INSTALL
USE_SYSTEM_BIND11
USE_SYSTEM_NCCL
USE_SYSTEM_GLOO
USE_SYSTEM_ONNX
For some, it might be useful to decouple them (NCCL) for others maybe not.