conda-forge / pytorch-cpu-feedstock

A conda-smithy repository for pytorch-cpu.
BSD 3-Clause "New" or "Revised" License
18 stars 50 forks source link

CUDA Build tracker #52

Closed hmaarrfk closed 3 years ago

hmaarrfk commented 3 years ago

We will be starting a cuda build run after https://github.com/conda-forge/pytorch-cpu-feedstock/pull/44

is merged. This table should help track the builds.

CUDA Build Tracker

MKL 2021: All 16 builds have been uploaded to conda-forge.

Configuration MKL 2020 16/16 MKL 2021 (16/16)
==1.9.0-*_0 ==1.9.0-*_1
label: forge On conda-forge
1 cuda 10.2 python3.6. ramonaoptics conda-forge
2 cuda 10.2 python3.7. ramonaoptics conda-forge
3 cuda 10.2 python3.8. ramonaoptics conda-forge
4 cuda 10.2 python3.9. ramonaoptics conda-forge
5 cuda 11.0 python3.6. ramonaoptics conda-forge
6 cuda 11.0 python3.7. ramonaoptics conda-forge
7 cuda 11.0 python3.8. ramonaoptics conda-forge
8 cuda 11.0 python3.9. ramonaoptics conda-forge
9 cuda 11.1 python3.6. ramonaoptics conda-forge
10 cuda 11.1 python3.7. ramonaoptics conda-forge
11 cuda 11.1 python3.8. ramonaoptics conda-forge
12 cuda 11.1 python3.9. ramonaoptics conda-forge
13 cuda 11.2 python3.6. ramonaoptics conda-forge
14 cuda 11.2 python3.7. ramonaoptics conda-forge
15 cuda 11.2 python3.8. ramonaoptics conda-forge
16 cuda 11.2 python3.9. ramonaoptics conda-forge

https://www.tablesgenerator.com/markdown_tables#

Channels

IvanYashchuk commented 3 years ago

Mark, could you please explain the process shortly. I was thinking that it's not allowed to upload manually built packages to conda-forge. People would build the listed CUDA builds and upload it to some storage, then the feedstock maintainers would upload it manually to conda-forge channel, right?

hmaarrfk commented 3 years ago

upload them to your own public anaconda channel.

i kinda want to merge the mkl migration first

rgommers commented 3 years ago

This approach seems a little fragile, and more work than needed especially in the long term. There's 16 builds here, one build takes 15-25 minutes on a decent build machine. So in 4-6 hours, all binaries can be built. How about writing a reproducible and well-documented build script and letting a single person with access to a good build server build everything at once?

hmaarrfk commented 3 years ago
#!/usr/env/bin bash

set -ex
conda activate base
# Ensure that the anaconda command exists for uploading
which anaconda

docker system prune --force
configs=$(find .ci_support/ -type f -name '*cuda_compiler_version[^nN]*' -printf "%p ")
anaconda upload  --skip build_artifacts/linux-64/pytorch*

# Assuming a powerful enough machine with many cores
# 10 seems to be a good point where things don't run out of RAM too much.
export CPU_COUNT=10

for config_filename in $configs; do
    filename=$(basename ${config_filename})
    config=${filename%.*}
    if [ -f build_artifacts/conda-forge-build-done-${config} ]; then
        echo skipped $config
        continue
    fi

    python build-locally.py $config
    # docker images get quite big clean them up after each build to save your disk....
    docker system prune --force
    anaconda upload  --skip build_artifacts/linux-64/pytorch*
done

15-25 mins.... what kind of machine do you have access to?

hmaarrfk commented 3 years ago

^^^^ its kinda a serious question. I'm genuinely interested in knowing.

rgommers commented 3 years ago

15-25 mins.... what kind of machine do you have access to?

Desktop 12-core / 32 GB. And we have a few 32-core / 128 GB dev servers.

rgommers commented 3 years ago

If that script is all there is to it, that's quite nice. Also let's make sure it doesn't blow up disk space completely. Are they all separate Docker images, and how much space does one take?

hmaarrfk commented 3 years ago

the docker images do take space.

my build machine,'s root storage is full and docket is complaining about disk space for me.

hmaarrfk commented 3 years ago

@rgommers I'm not really sure what happened but my build time was closer to 8 hours on a AMD Ryzen 7 3700X 8-Core Processor. Maybe my processor was over subscribed, but I started a build just now and checked that nobody else was using the server for at least an hour. I can report if the build finishes in under an hour, but I somewhat doubt it.

rgommers commented 3 years ago

That seems really long. There is a lot of stuff to turn on and off, so I probably cheated here by turning a few of the expensive things off - in particular using USE_DISTRIBUTED=0. That said, here is an impression of the PyTorch CI build stages on CircleCI:

image

The >1 hr one is the mobile build. Regular Linux builds are in the 20-50 min range. 8 hours doesn't sound right, something must be misconfigured for you.

hmaarrfk commented 3 years ago

Thanks for the info. I think I'm not using all CPUs. I can see that only 2 are being used. I think I probably need to send an other environment variable through. I'll have to see how I can pass it through.

rgommers commented 3 years ago

MAX_JOBS is the env var that controls how many cores the pytorch build will use.

I have started one build with build-locally.py to time how long it actually takes for me.

hmaarrfk commented 3 years ago

Maxjobs is set by CPU_COUNT it seems. A conda-forge variable that sets the number of processors for CI builds.

hmaarrfk commented 3 years ago

I'm running

CPU_COUNT=$(nproc) time python build-locally.py

To see how much it helps. I think it should help alot. Thanks for helping debug.

I'm updating the suggested script too.

rgommers commented 3 years ago

$(nproc)

That's still not quite right for me. I get:

$ nproc
2
$ nproc --all
24

The optimal number is the number of physical cores I think. 24 will be slower than 12.

rgommers commented 3 years ago

It takes about 15 minutes before the build actually starts - downloading + solving the build env + cloning the repo is very slow.

And probably at the end it'll take another 10 minutes, IIRC another conda solve is needed to set up the test env. And no tests are run other than import torch, so leaving this out of the recipe could help.

hmaarrfk commented 3 years ago

do you have a dual CPU machine? or a big little architecture machine. I've found that hyperthreading does somewhat help when compiling small files.

hmaarrfk commented 3 years ago

i can update the instrucyions when i get back to my computer to divide by two.

rgommers commented 3 years ago

So okay, this does take a painfully long time. It took almost exactly 2 hours for me using 10 cores. There's no good way I can see to get a detailed breakdown of that, but my estimate based on peaking at the terminal output during meetings and the resource usage output in the build log:

A large part of the build time seems to be spent building Caffe2. @IvanYashchuk was looking at disabling that, hopefully it's possible (but probably nontrivial). The number of CUDA architectures to build for is the main difference between a dev build and a conda package build. For the former it's just the architecture of the GPU installed in the machine plus PTX, for the latter it's 7 or 8 architectures.

The build used 7.3 GB of disk space for build_artifacts, plus whatever the Docker image took. 2.4 GB of those 7.3 GB is for a full clone of the repo (takes a while to clone too). Why not use a shallower clone here?

Details on usage statistics from the build log:

``` Resource usage statistics from bundling pytorch: Process count: 65 CPU time: Sys=0:24:42.5, User=11:26:44.8 Memory: 11.5G Disk usage: 2.4M Time elapsed: 1:30:08.9 Resource usage statistics from testing pytorch: Process count: 12 CPU time: Sys=0:00:23.4, User=0:06:35.3 Memory: 2.7G Disk usage: 85.6K Time elapsed: 0:08:38.5 Resource usage statistics from testing pytorch-gpu: Process count: 1 CPU time: Sys=0:00:00.0, User=- Memory: 3.0M Disk usage: 16B Time elapsed: 0:00:02.9 Resource usage summary: Total time: 2:00:05.6 CPU usage: sys=0:25:05.9, user=11:33:20.1 Maximum memory usage observed: 11.5G Total disk usage observed (not including envs): 2.5M ```

So it looks like if we use half the cores on a 32-core machine, the total time will be about 1hr 30min. So 16 builds takes ~24 hrs and 160 GB of space.

It's a bit painful, but still preferable to build everything on a single machine imho - less chances for mistakes to leak in.

EDIT: for completeness, the shell script to prep the build to ensure I don't pick up env vars from my default config:

unset USE_DISTRIBUTED
unset USE_MKLDNN
unset USE_FBGEMM
unset USE_NNPACK
unset USE_QNNPACK
unset USE_XNNPACK
unset USE_NCCL
unset USE_CUDA
export MAX_JOBS=10
export CPU_COUNT=10
benjaminrwilson commented 3 years ago

Awesome analysis, @rgommers. Thank you for taking the time to put this together. Do you see any potential option forward for getting these builds under the Azure CI timeout? Or other options for automatically building them on the cloud?

hmaarrfk commented 3 years ago

Why is it important to disable Caffe2 builds? Do you mean trying to share the stuff under torch_cpu in caffee between builds?

hmaarrfk commented 3 years ago

As for why we don't use shallow clones, it is because they don't end up being too shallow. And finally, it seems to be hard to checkout the tag.

I raised the issue to boa https://github.com/mamba-org/boa/issues/172

rgommers commented 3 years ago

Do you see any potential option forward for getting these builds under the Azure CI timeout? Or other options for automatically building them on the cloud?

Probably not on 2 cores in 6 hours, especially for CUDA 11, unless Caffe2 can be disabled. The list of architectures keeps growing, for 11.2 it's:

$TORCH_CUDA_ARCH_LIST;6.0;6.1;7.0;7.5;8.0;8.6

It may be possibe to prune that, but then there's deviations from the official package. A Tesla P100 or P4 (see https://developer.nvidia.com/cuda-gpus) is still in use I think, and it'll be hard to predict for users what GPUs are supported by what conda packages then.

Hooking in a custom builder so CI can be triggered is of course possible (and planned for GPU testing), but both work to implement and costly. PyTorch is not unique here. Other packages like Qt and TensorFlow have the same problem that they take too long to build. That's more a question for the conda-forge core team; I'm not aware of a plan for this.

Why is it important to disable Caffe2 builds? Do you mean trying to share the stuff under torch_cpu in caffee between builds?

No, actually disable. There's a lot that's being built there that's not needed - either relevant for mobile build, or just leftovers. Example: there's torch.nn.AvgPool2d which is what users want, and then there's a Caffe2 AveragePool2D operator which is different. The plan for official PyTorch wheels and conda packages is to get rid of Caffe2 at some point.

hmaarrfk commented 3 years ago

@rgommers i'm not sure what the path forward is for today.

Are you able to build everything over 24/48 hours? Otherwise, I can keep chugging along building on my servers over nights.

benjaminrwilson commented 3 years ago

@rgommers, @hmaarrfk, is there anyway to split up the per-architecture builds? Could we feasibly have separate jobs for each supported CUDA arch?

hmaarrfk commented 3 years ago

@benjaminrwilson they are seperated.

You can locally run python build-locally.py and select the configuration you want to build.

I've just been manually running them one at a time. rgrommers is trying to find a "more effecient" way to do this for long term maintainability.

hmaarrfk commented 3 years ago

I then upload them to my anaconda channel. Later conda-forge can download the packages from there and upload it to their own channel.

benjaminrwilson commented 3 years ago

Are the actual gpu-specific builds being separated too? Maybe I'm missing something, but it looks like the runs are split by CUDA version, but not architecture as well: https://github.com/conda-forge/pytorch-cpu-feedstock/blob/ac31db4ad35178accd10d8fcf88fcf562ec82874/recipe/build_pytorch.sh#L92. I mean adding another level of the build matrix as a product of the the options in that link.

rgommers commented 3 years ago

Are you able to build everything over 24/48 hours? Otherwise, I can keep chugging along building on my servers over nights.

I'm wrapping up things to go on holiday next week, so it's probably best if I didn't say yes.

Are the actual gpu-specific builds being separated too? Maybe I'm missing something, but it looks like the runs are split by CUDA version, but not architecture as well:

Indeed, I don't think there's a good way to do this.

hmaarrfk commented 3 years ago

Ah i see. TBH: This isn't the scope of this issue tracker. I really just want to get builds for pytorch 1.9 out there with GPU support.

If you think it is worth us discussing this please open a new issue to improve the build process.

We can then define goals and have a more focused discussion.

hmaarrfk commented 3 years ago

Ok. I got my hands on a system that I might reasonable be able to leave running for a day or two alone.

I've started with the MKL2021 builds on it and I'll report tomorrow if it is doing well.

hmaarrfk commented 3 years ago

3 builds = 10 hours. 16 builds = 54 hours.

I guess it should be done by the end of the weekend.

hmaarrfk commented 3 years ago

@isuruf MKL 2021 builds are complete. Is that enough for this? I might not have enough spare compute (or free time) to build for MKL 2020.

Are you able to upload to conda-forge from my channel?

h-vetinari commented 3 years ago

How are things standing with the upload of the artefacts? 🙃

benjaminrwilson commented 3 years ago

@hmaarrfk, have you been able to get in touch with @isuruf?

hmaarrfk commented 3 years ago

generally. people might be busy.

i try to ping once a week. or once every two weeks.

isuruf is very motivated. I'm sure he hasn't forgotten about this.

isuruf commented 3 years ago

@hmaarrfk, can you mark the _1 builds with a label?

hmaarrfk commented 3 years ago

Added. The label is forge

hmaarrfk commented 3 years ago

Disabling enough stuff gets things almost passing.

But as expected, when building for many GPUs at once, it does take longer and longer.

Honestly, I would like to keep building for multiple GPUs.

On my systems, I often pair up a GT 1030 with a newer GPU to utilize the newer GPU at the full extent (as opposed to using it for X11 as well)

https://github.com/conda-forge/pytorch-cpu-feedstock/pull/64

benjaminrwilson commented 3 years ago

Yeah, I completely get that. I guess one thing for us to consider is:

nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

Additionally, we could consider is nvcc multithreading for certain CUDA versions (although, I don't think this will solve everything): https://github.com/pytorch/builder/blob/e05c57608d7ee57bdbd9075ca604b0288ad86c25/manywheel/build.sh#L263

hmaarrfk commented 3 years ago

ok, i'm trying multi threading.

benjaminrwilson commented 3 years ago

Looks like the option is available with cudatoolkit >=11.2: https://docs.nvidia.com/cuda/archive/11.2.0/cuda-compiler-driver-nvcc/index.html.

hmaarrfk commented 3 years ago

maybe then we try not to ccompress.

hmaarrfk commented 3 years ago

I guess it is time to wait 6 hours.

hmaarrfk commented 3 years ago

For what its worth, I'm rebuilding for mkl 2020 but who knows if it will finish. Maybe they will be done by next week.

hmaarrfk commented 3 years ago

Ok. I don't think I can upload anymore to my own channel. I might have to remove some packages just to make space for my day job.

image

isuruf commented 3 years ago

I've uploaded _1 builds

h-vetinari commented 3 years ago

Huge thanks @hmaarrfk and @isuruf for seeing this through!

hmaarrfk commented 3 years ago

@isuruf are you able to upload the _0 builds. I removed the _1 builds from my channel and added forge to all the _0 builds.

hmaarrfk commented 3 years ago

I think that the mkl2021 migration is complete and we can likely just avoid uploading the _0 builds and save some storage space on anaconda.