Closed hmaarrfk closed 3 years ago
Mark, could you please explain the process shortly. I was thinking that it's not allowed to upload manually built packages to conda-forge. People would build the listed CUDA builds and upload it to some storage, then the feedstock maintainers would upload it manually to conda-forge channel, right?
upload them to your own public anaconda channel.
i kinda want to merge the mkl migration first
This approach seems a little fragile, and more work than needed especially in the long term. There's 16 builds here, one build takes 15-25 minutes on a decent build machine. So in 4-6 hours, all binaries can be built. How about writing a reproducible and well-documented build script and letting a single person with access to a good build server build everything at once?
#!/usr/env/bin bash
set -ex
conda activate base
# Ensure that the anaconda command exists for uploading
which anaconda
docker system prune --force
configs=$(find .ci_support/ -type f -name '*cuda_compiler_version[^nN]*' -printf "%p ")
anaconda upload --skip build_artifacts/linux-64/pytorch*
# Assuming a powerful enough machine with many cores
# 10 seems to be a good point where things don't run out of RAM too much.
export CPU_COUNT=10
for config_filename in $configs; do
filename=$(basename ${config_filename})
config=${filename%.*}
if [ -f build_artifacts/conda-forge-build-done-${config} ]; then
echo skipped $config
continue
fi
python build-locally.py $config
# docker images get quite big clean them up after each build to save your disk....
docker system prune --force
anaconda upload --skip build_artifacts/linux-64/pytorch*
done
15-25 mins.... what kind of machine do you have access to?
^^^^ its kinda a serious question. I'm genuinely interested in knowing.
15-25 mins.... what kind of machine do you have access to?
Desktop 12-core / 32 GB. And we have a few 32-core / 128 GB dev servers.
If that script is all there is to it, that's quite nice. Also let's make sure it doesn't blow up disk space completely. Are they all separate Docker images, and how much space does one take?
the docker images do take space.
my build machine,'s root storage is full and docket is complaining about disk space for me.
@rgommers I'm not really sure what happened but my build time was closer to 8 hours on a AMD Ryzen 7 3700X 8-Core Processor
. Maybe my processor was over subscribed, but I started a build just now and checked that nobody else was using the server for at least an hour. I can report if the build finishes in under an hour, but I somewhat doubt it.
That seems really long. There is a lot of stuff to turn on and off, so I probably cheated here by turning a few of the expensive things off - in particular using USE_DISTRIBUTED=0
. That said, here is an impression of the PyTorch CI build stages on CircleCI:
The >1 hr one is the mobile build. Regular Linux builds are in the 20-50 min range. 8 hours doesn't sound right, something must be misconfigured for you.
Thanks for the info. I think I'm not using all CPUs. I can see that only 2 are being used. I think I probably need to send an other environment variable through. I'll have to see how I can pass it through.
MAX_JOBS
is the env var that controls how many cores the pytorch build will use.
I have started one build with build-locally.py
to time how long it actually takes for me.
Maxjobs is set by CPU_COUNT it seems. A conda-forge variable that sets the number of processors for CI builds.
I'm running
CPU_COUNT=$(nproc) time python build-locally.py
To see how much it helps. I think it should help alot. Thanks for helping debug.
I'm updating the suggested script too.
$(nproc)
That's still not quite right for me. I get:
$ nproc
2
$ nproc --all
24
The optimal number is the number of physical cores I think. 24
will be slower than 12
.
It takes about 15 minutes before the build actually starts - downloading + solving the build env + cloning the repo is very slow.
And probably at the end it'll take another 10 minutes, IIRC another conda solve is needed to set up the test env. And no tests are run other than import torch
, so leaving this out of the recipe could help.
do you have a dual CPU machine? or a big little architecture machine. I've found that hyperthreading does somewhat help when compiling small files.
i can update the instrucyions when i get back to my computer to divide by two.
So okay, this does take a painfully long time. It took almost exactly 2 hours for me using 10 cores. There's no good way I can see to get a detailed breakdown of that, but my estimate based on peaking at the terminal output during meetings and the resource usage output in the build log:
A large part of the build time seems to be spent building Caffe2. @IvanYashchuk was looking at disabling that, hopefully it's possible (but probably nontrivial). The number of CUDA architectures to build for is the main difference between a dev build and a conda package build. For the former it's just the architecture of the GPU installed in the machine plus PTX, for the latter it's 7 or 8 architectures.
The build used 7.3 GB of disk space for build_artifacts
, plus whatever the Docker image took. 2.4 GB of those 7.3 GB is for a full clone of the repo (takes a while to clone too). Why not use a shallower clone here?
Details on usage statistics from the build log:
So it looks like if we use half the cores on a 32-core machine, the total time will be about 1hr 30min. So 16 builds takes ~24 hrs and 160 GB of space.
It's a bit painful, but still preferable to build everything on a single machine imho - less chances for mistakes to leak in.
EDIT: for completeness, the shell script to prep the build to ensure I don't pick up env vars from my default config:
unset USE_DISTRIBUTED
unset USE_MKLDNN
unset USE_FBGEMM
unset USE_NNPACK
unset USE_QNNPACK
unset USE_XNNPACK
unset USE_NCCL
unset USE_CUDA
export MAX_JOBS=10
export CPU_COUNT=10
Awesome analysis, @rgommers. Thank you for taking the time to put this together. Do you see any potential option forward for getting these builds under the Azure CI timeout? Or other options for automatically building them on the cloud?
Why is it important to disable Caffe2 builds? Do you mean trying to share the stuff under torch_cpu in caffee between builds?
As for why we don't use shallow clones, it is because they don't end up being too shallow. And finally, it seems to be hard to checkout the tag.
I raised the issue to boa https://github.com/mamba-org/boa/issues/172
Do you see any potential option forward for getting these builds under the Azure CI timeout? Or other options for automatically building them on the cloud?
Probably not on 2 cores in 6 hours, especially for CUDA 11, unless Caffe2 can be disabled. The list of architectures keeps growing, for 11.2 it's:
$TORCH_CUDA_ARCH_LIST;6.0;6.1;7.0;7.5;8.0;8.6
It may be possibe to prune that, but then there's deviations from the official package. A Tesla P100 or P4 (see https://developer.nvidia.com/cuda-gpus) is still in use I think, and it'll be hard to predict for users what GPUs are supported by what conda packages then.
Hooking in a custom builder so CI can be triggered is of course possible (and planned for GPU testing), but both work to implement and costly. PyTorch is not unique here. Other packages like Qt and TensorFlow have the same problem that they take too long to build. That's more a question for the conda-forge core team; I'm not aware of a plan for this.
Why is it important to disable Caffe2 builds? Do you mean trying to share the stuff under torch_cpu in caffee between builds?
No, actually disable. There's a lot that's being built there that's not needed - either relevant for mobile build, or just leftovers. Example: there's torch.nn.AvgPool2d
which is what users want, and then there's a Caffe2 AveragePool2D
operator which is different. The plan for official PyTorch wheels and conda packages is to get rid of Caffe2 at some point.
@rgommers i'm not sure what the path forward is for today.
Are you able to build everything over 24/48 hours? Otherwise, I can keep chugging along building on my servers over nights.
@rgommers, @hmaarrfk, is there anyway to split up the per-architecture builds? Could we feasibly have separate jobs for each supported CUDA arch?
@benjaminrwilson they are seperated.
You can locally run python build-locally.py
and select the configuration you want to build.
I've just been manually running them one at a time. rgrommers is trying to find a "more effecient" way to do this for long term maintainability.
I then upload them to my anaconda channel. Later conda-forge can download the packages from there and upload it to their own channel.
Are the actual gpu-specific builds being separated too? Maybe I'm missing something, but it looks like the runs are split by CUDA version, but not architecture as well: https://github.com/conda-forge/pytorch-cpu-feedstock/blob/ac31db4ad35178accd10d8fcf88fcf562ec82874/recipe/build_pytorch.sh#L92. I mean adding another level of the build matrix as a product of the the options in that link.
Are you able to build everything over 24/48 hours? Otherwise, I can keep chugging along building on my servers over nights.
I'm wrapping up things to go on holiday next week, so it's probably best if I didn't say yes.
Are the actual gpu-specific builds being separated too? Maybe I'm missing something, but it looks like the runs are split by CUDA version, but not architecture as well:
Indeed, I don't think there's a good way to do this.
Ah i see. TBH: This isn't the scope of this issue tracker. I really just want to get builds for pytorch 1.9 out there with GPU support.
If you think it is worth us discussing this please open a new issue to improve the build process.
We can then define goals and have a more focused discussion.
Ok. I got my hands on a system that I might reasonable be able to leave running for a day or two alone.
I've started with the MKL2021 builds on it and I'll report tomorrow if it is doing well.
3 builds = 10 hours. 16 builds = 54 hours.
I guess it should be done by the end of the weekend.
@isuruf MKL 2021 builds are complete. Is that enough for this? I might not have enough spare compute (or free time) to build for MKL 2020.
Are you able to upload to conda-forge from my channel?
How are things standing with the upload of the artefacts? 🙃
@hmaarrfk, have you been able to get in touch with @isuruf?
generally. people might be busy.
i try to ping once a week. or once every two weeks.
isuruf is very motivated. I'm sure he hasn't forgotten about this.
@hmaarrfk, can you mark the _1 builds with a label?
Added. The label is forge
Disabling enough stuff gets things almost passing.
But as expected, when building for many GPUs at once, it does take longer and longer.
Honestly, I would like to keep building for multiple GPUs.
On my systems, I often pair up a GT 1030 with a newer GPU to utilize the newer GPU at the full extent (as opposed to using it for X11 as well)
https://github.com/conda-forge/pytorch-cpu-feedstock/pull/64
Yeah, I completely get that. I guess one thing for us to consider is:
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Additionally, we could consider is nvcc
multithreading for certain CUDA versions (although, I don't think this will solve everything): https://github.com/pytorch/builder/blob/e05c57608d7ee57bdbd9075ca604b0288ad86c25/manywheel/build.sh#L263
ok, i'm trying multi threading.
Looks like the option is available with cudatoolkit >=11.2
: https://docs.nvidia.com/cuda/archive/11.2.0/cuda-compiler-driver-nvcc/index.html.
maybe then we try not to ccompress.
I guess it is time to wait 6 hours.
For what its worth, I'm rebuilding for mkl 2020 but who knows if it will finish. Maybe they will be done by next week.
Ok. I don't think I can upload anymore to my own channel. I might have to remove some packages just to make space for my day job.
I've uploaded _1 builds
Huge thanks @hmaarrfk and @isuruf for seeing this through!
@isuruf are you able to upload the _0
builds. I removed the _1
builds from my channel and added forge
to all the _0
builds.
I think that the mkl2021 migration is complete and we can likely just avoid uploading the _0
builds and save some storage space on anaconda.
We will be starting a cuda build run after https://github.com/conda-forge/pytorch-cpu-feedstock/pull/44
is merged. This table should help track the builds.
CUDA Build Tracker
MKL 2021: All 16 builds have been uploaded to conda-forge.
==1.9.0-*_0
==1.9.0-*_1
forge
https://www.tablesgenerator.com/markdown_tables#
Channels