Use more system libraries

hmaarrfk commented 3 years ago

I get that this comes at a cost, I just wanted to list these out in case they can help us get down to a below 6 hour build time:

I found these variables in the cmake/Depenencies.cmake

USE_SYSTEM_PTHREADPOOL
USE_SYSTEM_CPUINFO
USE_SYSTEM_XNNPACK
USE_SYSTEM_FP16
USE_SYSTEM_EIGEN_INSTALL
USE_SYSTEM_BIND11
USE_SYSTEM_NCCL
USE_SYSTEM_GLOO
USE_SYSTEM_ONNX

For some, it might be useful to decouple them (NCCL) for others maybe not.

isuruf commented 3 years ago

USE_SYSTEM_XNNPACK seems to be the most beneficial.

hmaarrfk commented 3 years ago

do we have xnnpack feedstock? do you think it is safe to use it since it is "undocumented" in their setup.py file?

isuruf commented 3 years ago

It's documented at https://github.com/pytorch/pytorch/blob/e318058ffe662d426617a405fb21e6470dfc1219/CMakeLists.txt#L369, so I think it's fine to use.

No, we don't have a xnnpack feedstock, but it shouldn't be too hard.

hmaarrfk commented 3 years ago

unfortunately, xnnpack doesn't have any versions. at least none that i can find glancing at the Google GitHub page for it.

seems like we would be creating version numbers. which I'm not a fan of.

isuruf commented 3 years ago

We can go for a date based version numbering. It's not ideal, but better than building pytorch manually.

isuruf commented 3 years ago

Debian has versions like 0.0~git20201221.e1ffe15. We can go with something like 0.0.2020.12.21.e1ffe15

hmaarrfk commented 3 years ago

would we create sacred versions for the ones pytorch pins to?

isuruf commented 3 years ago

Yes, we should create versions needed by pytorch. For eg: pytorch 1.9.0 uses 55d53a4e7079d38e90acd75dd9e4f9e781d2da35, so 0.0.0.2021.02.23.55d53a4 and have run_exports that pin exactly.

isuruf commented 3 years ago

Nvm, xnnpack will save only a few minutes.

hmaarrfk commented 3 years ago

Ok thanks.

NCCL seems to take 20-30 mins to build. That might just throw us under the 6 hours. Lets see.

The other option I want to try first is to use mamba but after NCCL results.

hmaarrfk commented 3 years ago

As expected, XNNPACK is quick to compile https://github.com/conda-forge/staged-recipes/pull/15865

andfoy commented 2 years ago

@hmaarrfk @isuruf @IvanYashchuk, I'm going to help with reducing the building time of PyTorch on the conda-forge CIs, could you please let me know which strategies have been already tried, even unsuccessful?

hmaarrfk commented 2 years ago

honestly, where i was at next was going to try to separate out the common CPU code and make that into a library. however, if we can't get the the GPU builds to finish, and those seem to take time since they target a wide class of GPUs, it seems a little moot.

I'm not too excited to depart from upstream build procedures.

andfoy commented 2 years ago

So, in that order of ideas the idea would be to strip ATen and these sort of core libraries apart?

isuruf commented 2 years ago

Splitting off libraries is not going to help much with the time taken to build the GPU packages because compiling the GPU code takes a lot of time due to the number of GPU architecture we compile for.

IvanYashchuk commented 2 years ago

Current Azure Pipelines are limited to 6h. Because of that currently, the packages are manually built and uploaded, which is not ideal. There were attempts to compile less code, for example switching off Caffe2 builds https://github.com/conda-forge/pytorch-cpu-feedstock/pull/64 or prunning old CUDA architectures https://github.com/conda-forge/pytorch-cpu-feedstock/pull/47/commits/eb19415cfd620fa500bb2256176264be77f16efa.

I think the current best way to achieve semi-automated uploads using Azure Pipelines is to utilize ccache or sccache and save the caching directory using Azure's Cache@2 task. I'm testing it here. Two caveats there:

cache is only saved in Azure only when all steps are successful (see https://github.com/microsoft/azure-pipelines-tasks/issues/11950#issuecomment-597899556 and https://github.com/microsoft/azure-pipelines-tasks/issues/13535)
cache is immutable when created, which means we cannot gradually grow our cache with several jobs run one after each other. It gives us only two runs with 6h limit, the first will save the cache close to timeout and the second job would use the cache to finish the build.

Looking at the last commit from master 6 hours in Linux CUDA builds gets us around [4381/4911] or 89% files compiled. Two consecutive runs (the second one needs to be manually triggered) would most likely finish the builds.

hmaarrfk commented 2 years ago

I've found that the last few 100 files are the ones that take the longest time to compile....

hmaarrfk commented 2 years ago

On a powerful machine, Threadripper 2 + 128 GB RAM, here are a few timings:

mark@ostrich $ ls build_artifacts/*build* -lahtcr
-rw-r--r-- 1 mark mark 0 Oct 23 10:21 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.0cudnn8numpy1.18python3.8.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 12:09 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version10.2cudnn7numpy1.19python3.9.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 14:50 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.2cudnn8numpy1.18python3.7.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 17:10 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.1cudnn8numpy1.18python3.7.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 19:27 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.1cudnn8numpy1.18python3.8.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 21:14 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version10.2cudnn7numpy1.18python3.8.____cpython
-rw-r--r-- 1 mark mark 0 Oct 23 23:16 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.0cudnn8numpy1.18python3.7.____cpython
-rw-r--r-- 1 mark mark 0 Oct 24 01:55 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.2cudnn8numpy1.18python3.8.____cpython
-rw-r--r-- 1 mark mark 0 Oct 24 03:43 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version10.2cudnn7numpy1.18python3.7.____cpython
-rw-r--r-- 1 mark mark 0 Oct 24 05:44 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.0cudnn8numpy1.19python3.9.____cpython
-rw-r--r-- 1 mark mark 0 Oct 24 08:23 build_artifacts/conda-forge-build-done-linux_64_cuda_compiler_version11.2cudnn8numpy1.19python3.9.____cpython

You can see that cuda compiler 11.2 takes about 2:40 mins, while other builds take about 2 hours.

Regarding caching, is the total job limit 6 hours, or is each step limited to 6 hours?

IvanYashchuk commented 2 years ago

Regarding caching, is the total job limit 6 hours, or is each step limited to 6 hours?

6 hours is the total job limit.

conda-forge / pytorch-cpu-feedstock

Use more system libraries #60