hmaarrfk commented 2 years ago

Comment:

This package currently requires more than 16 builds to be build manually to ensure that it completes in time on the CIs.

Step 1: No more git clone

rgommers identified that one portion of the build process that takes time is cloning the repository. In my experience, cloning the 1.5GB repo can take up to 10 min on my powerful local machine, but I feel like it can take much longer on the CIs.

To avoid cloning, we will have to list out all the submodule manually, or make the conda-forge installable dependencies.

I mostly got this working using a recursive script which should help us keep it maintained: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/109

Option 1: Split off Dependencies:

Dependency	linux	mac	win	GPU Aware	PR	system deps
pybind11				no	https://github.com/conda-forge/pybind11-feedstock	USE_SYSTEM_PYBIND11
cub				no	https://github.com/conda-forge/cub-feedstock
eigen				no	https://github.com/conda-forge/eigen-feedstock	USE_SYSTEM_EIGEN_INSTALL
googletest				no	will not package
benchmark				no	https://github.com/conda-forge/benchmark-feedstock
protobuf				no	https://github.com/conda-forge/libprotobuf-feedstock
ios-cmake					not needed since we don't target ios
NNPACK	yes	yes		no	https://github.com/conda-forge/staged-recipes/pull/19103
gloo	yes	yes		yes	https://github.com/conda-forge/staged-recipes/pull/19103	USE_SYSTEM_GLOO
pthreadpool	yes	yes		no	https://github.com/conda-forge/staged-recipes/pull/19103	USE_SYSTEM_PTHREADPOOL
FXdiv	yes	yes		header	https://github.com/conda-forge/staged-recipes/pull/19103	USE_SYSTEM_FXDIV
FP16	yes	yes		header	https://github.com/conda-forge/staged-recipes/pull/19103	USE_SYSTEM_FP16
psimd	yes	yes		header	https://github.com/conda-forge/staged-recipes/pull/19103	USE_SYSTEM_PSIMD
zstd	yes	yes	yes	no	https://github.com/conda-forge/zstd-feedstock
cpuinfo	yes	yes	no	no	https://github.com/conda-forge/staged-recipes/pull/19103	USE_SYSTEM_CPUINFO
python-enum				no	https://github.com/conda-forge/enum34-feedstock
python-peachpy	yes	yes	yes	no	https://github.com/conda-forge/staged-recipes/pull/19103
python-six	yes	yes	yes	no	https://github.com/conda-forge/six-feedstock
onnx				no	https://github.com/conda-forge/onnx-feedstock	USE_SYSTEM_ONNX
onnx-tensorrt				only
sleef				no	https://github.com/conda-forge/sleef-feedstock	USE_SYSTEM_SLEEF
ideep
oneapisrc
nccl					https://github.com/conda-forge/nccl-feedstock
gemmlowp
QNNPACK	yes	yes			https://github.com/conda-forge/staged-recipes/pull/19103
neon2sse
fbgemm				yes
foxi
tbb					https://github.com/conda-forge/tbb-feedstock	USE_SYSTEM_TBB (deprecated)
fbjni
XNNPACK	yes	yes			https://github.com/conda-forge/staged-recipes/pull/19103	USE_SYSTEM_XNNPACK
fmt					https://github.com/conda-forge/fmt-feedstock
tensorpipe				yes
cudnn_frontend
kineto
pocketfft
breakpad
flatbuffers	yes	yes	yes	no	https://github.com/conda-forge/flatbuffers-feedstock
clog	static	static			https://github.com/conda-forge/staged-recipes/pull/19103

clog seems to be a pretty low level library that is assisted by compile time flags. I think it is best if we don't package that one as a library. It seems like it will require some serious consideration in terms of performance if we do. They typically the full source in the repository. The only problematic thing, is that each package attempts to install the static library into the library path.
QNNPACK has a build option to allow a special provision for CAFFE2's implementation of pthreadpool
- It seems to be problematic with pthreadpool on OSX.
QNNPACK likely has two different implementations, the one they vendored in ATen, and the one they vendored in third_party.
NNPACK has two different backens, one generated by python it seems, but for some reason fp16.py cannot be found, the other with psimd.

Option 2 - step 1: Build a libpytorch package or something

By setting BUILD_PYTHON=OFF in https://github.com/conda-forge/pytorch-cpu-feedstock/pull/112/ we then end up with the following libraries in lib and include:

Dependency	linux	mac	GPU Aware	PR
libasmjit	yes	yes		https://github.com/conda-forge/staged-recipes/pull/19103
libc10	yes	yes		https://github.com/conda-forge/staged-recipes/pull/19103
libfbgemm	yes	yes	yes	https://github.com/conda-forge/staged-recipes/pull/19103
libgloo	yes	yes	yes
libkineto	yes		yes	https://github.com/conda-forge/staged-recipes/pull/19103
libnnpack	yes		???	https://github.com/conda-forge/staged-recipes/pull/19103
libpytorch_qnnpack	yes	yes		https://github.com/conda-forge/staged-recipes/pull/19103
libqnnpack	yes	yes		https://github.com/conda-forge/staged-recipes/pull/19103
libtensorpipe			yes
libtorch
libtorch_cpu
libtorch_global_deps
Header only
ATen
c10d
caffe2
libnop	yes	yes		https://github.com/conda-forge/staged-recipes/pull/19103

Option 2 - step 2: Depend on new ATen/libpytorch package

Compilation time progress

platform	python	cuda	main	tar gh-109	system deps
linux 64	3.7	no	1h57m	1h54m
linux 64	3.8	no	2h0m	1h51m
linux 64	3.9	no	2h31m	2h2m
linux 64	3.10	no	2h26m	2h7m
linux 64	3.7	11.2	6h+ (`3933/4242` 309 remaining)	6h+
linux 64	3.8	11.2	6h+ (`3897/4242` 345 remaning)	6h+
linux 64	3.9	11.2	6h+ (`3924/4242` 318 remaining)	6h+	6h+`1656/1969` 313 remaining
linux 64	3.10	11.2	6h+ (`3962/4242` 280 remaining)	6h+
osx-64	3.7		2h42m	2h39m
osx-64	3.8		3h28m	2h52m
osx-64	3.9		2h40m	2h42m
osx-64	3.10		3h2m	2h42m
osx-arm-64	3.8		1h51	1h37m
osx-arm-64	3.9		2h20m	2h10m
osx-arm-64	3.10		4h25m	2h1m

There are approximately:

3600 files to compile for cmake for the CPU builds with the standard build process
1600-1800 files to compile when using system dependencies: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/111

rgommers commented 2 years ago

To avoid cloning, we will have to list out all the submodule manually, or make the conda-forge installable dependencies.

Cloning with --depth 1 seems preferable to separately building as dependencies. Separate dependencies/feedstocks/packages is a lot of overhead and noise for something that isn't usable by anything other than this feedstock.

The script in gh-109 looks interesting. Should work like that I guess; I just forgot why using --depth doesn't work? Seems like a lacking feature in git itself if it doesn't allow a shallow clone.

hmaarrfk commented 2 years ago

I think the problem is that conda first clones the main branch with depth 1, then cannot switch to an older tag like version v1.11.0 because it didn't clone it.

It also didn't play well with caching.

It is somewhat of a job to unbundle but i guess I find it worthwhile if it means we can release this more easily. I'm hoping i can patch things in a way that is acceptable upstream.

hmaarrfk commented 2 years ago

I remember what conda tried to do:

it tried to clone a bare repo locally.
then uses that as a cache to clone the sources for the build.

That's not super valuable for CI workflows, but makes it hard to do a shallow clone. At the time I couldn't think of a solution to propose upstream to conda build.

rgommers commented 2 years ago

It is somewhat of a job to unbundle but i guess I find it worthwhile if it means we can release this more easily. I'm hoping i can patch things in a way that is acceptable upstream.

Makes sense. I have no problem with unbundling provided it doesn't change the sources that are built. The kind of unbundling Linux distros do, like "hey this project is pinned to version X of , but we insist on using our own version Y" is much more problematic, because you then build a combo of sources that is not tested at all in upstream CI, and may be plain buggy.

hmaarrfk commented 2 years ago

We kinda do that with along of c dependencies don't we (not as much in pytorc)

My hope is that i can split off onnx and ATen in versions that match pytorch.

hmaarrfk commented 2 years ago

You can follow somewhat of a first pass at step 2 here https://github.com/conda-forge/staged-recipes/pull/19103#issuecomment-1140429910

There are quite a few header only, and other libraries, that get downloaded on the fly using custom cmake code.

That isn't really fun. So even what I did in gh-109 isn't really complete, in terms of not downloading during bulid.

h-vetinari commented 2 years ago

That isn't really fun. So even what I did in gh-109 isn't really complete, in terms of not downloading during bulid.

That list of submodules is insane. Reminds me of what I quipped in #76 when I first came across that:

Pytorch has [...] under third_party/ (side note: holy maccaroni, that folder is a packager's nightmare 😅).

Seems even that was underestimating the extent of the issue. Unsurprisingly, I really dislike this "we vendor specific commits of open source projects" development model - it's a very "my castle" approach.

On the other hand, I see where it is coming from, with C/C++'s complete lack of standardised tooling around distribution.

h-vetinari commented 2 years ago

But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on.

All in all, I fully support ripping this apart one by one (hopefully even in ways that would be palatable upstream), but I get Ralf's point about not diverging from what's actually being tested - though I'd be fine to caveat that based on an actually conceivable risk of breakage (e.g. if there are no functional changes between the vendored commit and a released version in a given submodule)

hmaarrfk commented 2 years ago

On the other hand, I see where it is coming from, with C/C++'s complete lack of standardised tooling around distribution.

Right. This is likely what the original creators were grappling with. They decided to either use git submodules in certain projects, or cmake code to download things they needed. Bazel does the same.

But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on.

The issue occurs on who is in charge of the support. pytorch (and facebook) cannot force six or tbb to push a fix if their users (other developers at facebook) find a problem. Eventually, one user will have an issue. Because they have the developer resources, they decide to take on the responsibility of maintaining it for their package.

When pip was the only option, you were beholden to the creator of the original package on pypi pleading them to support a feature you need (i've been there many time, and in a sense, we are there with our packaging asks for pytorch).

h-vetinari commented 2 years ago

The issue occurs on who is in charge of the support. pytorch (and facebook) cannot force six or tbb to push a fix if their users (other developers at facebook) find a problem. Eventually, one user will have an issue. Because they have the developer resources, they decide to take on the responsibility of maintaining it for their package.

Sure, but what's missing IMO is closing the loop to a released version with the bugfix afterwards.

hmaarrfk commented 2 years ago

Sure, but what's missing IMO is closing the loop to a released version with the bugfix afterwards.

Its pretty hard to make a business case as to why you should spend a few hours, and likely more time, submitting a fix upstream after you have fixed things for your users.

Anyway, i'm just going through listing things that need to be done. There are a few big packages that we might be able to take advantage from.

rgommers commented 2 years ago

I think you're both missing a very important point: dependencies are fragile. Once you have O(25) dependencies, and would actually express them as dependencies, you become susceptible to a ton of extra bugs (even aside from a ton more packaging/distribution issues). It simply isn't workable.

I had to explain the same thing when SciPy added Boost as a submodule. Boost 1.75 was known to work, 1.76 is known to be broken, yet other versions are unknown. Having a single tested version limits the surface area of things you're exposed to, and also makes it easier to reproduce and debug issues. PyTorch has zero interesting dependencies at runtime (not even numpy anymore), and only one config of build-time library dependencies that are vendored in third-party/.

There's a few libraries, e.g. pybind11, that are well-maintained and low-risk to unbundle. But most of them aren't like that.

There's of a course a trade-off here- in build time, and "let's find bugs early so we can get them fixed for the greater good", but on average PyTorch is doing the right thing here if they want users to have a good experience.

Why vendor six?

six is designed to be vendored. As are other such simple utilities, like versioneer. It's not strange at all - dependencies are expensive.

hmaarrfk commented 2 years ago

rgommer, i really agree with:

dependencies are fragile.

which is why i brought up the case of support. You want to be able to control it if you are in charge of shipping a product.

I actually think we should likely skip the unbundling, and build an intermediary output instead. I'm mostly using this effort to try to understand them, and understand their build system.

I changed the top post to reflect this, listing "unbundling" and the intermediary library as two distinct options (potentially complimentary).

hmaarrfk commented 2 years ago

So I think I've gone as far as I want to. I actually got to the point where I was exactly 1 year ago, where I was trying to build ideep.

https://github.com/conda-forge/staged-recipes/pull/7491

Ultimately, my concern isn't the fact that I can build it, I think I can. But rather my concern is wether or not I can build it with similar enough options that pytorch tests with. That i'm not super excited about.

h-vetinari commented 2 years ago

I think you're both missing a very important point: dependencies are fragile. Once you have O(25) dependencies, and would actually express them as dependencies, you become susceptible to a ton of extra bugs (even aside from a ton more packaging/distribution issues). It simply isn't workable.

I agree with you on a lot of this, but let's please avoid assuming who's missing this point or that. I didn't say that everything should to be a direct dependency, or that that there can't be good reasons for moving to unreleased commits as a stopgap measure (with a work item to move back to a released version as it becomes available), or that it's inherently bad practice (the lack of good tooling forces projects into making really bad trade-offs, but disliking that state of affairs is not an accusation towards anyone).

But with ~60 submodules, not doing that makes integration work pretty much impenetrable, as we've seen for pytorch & tensorflow. I get that this discipline (or extra infrastructure for not using the vendored sources) has low perceived value for companies like Google and Meta, and this is a large part how the situation got to this point (in addition to the lack of good tooling e.g. like cargo).

I don't claim to have the answer (mostly questions) - if someone had a cookie-cutter solution, we'd have seen it by now. I still think that untangling this web of dependencies (possibly also into intermediate pieces) would be very worthwhile both for conda-forge itself and for upstream. Sadly, tensorflow hasn't even shown slight interest in fixing their circular build dependencies, so it's an uphill battle, and we have quite a ways to go on that...

hmaarrfk commented 2 years ago

@h-vetinari if you want to help on this effort, I think packaging onnx-tensorrt would be very helpful and is quite independent from the effort here.

I don't think it is as easy to plug it in, but I think it does add to the compilation time since I think it is GPUaware. So is fbgemm.

hmaarrfk commented 2 years ago

actually, just building libonnx would likely be a welcome first step!

hmaarrfk commented 2 years ago

But I don't get so many things in that list, especially mature projects. Why vendor six? tbb? fmt? pybind11? The list goes on.

In all fairness, they do provide overrides to "mature projects". We just never felt it was a good idea to use them since they don't really move the needle in terms of compilation time.

Ultimately, it is the projects that are "less mature" that they are using exact commits to.

Again, in fairness to them, these are fast moving projects, that seems to have been built quickly, for the specific use case of enabling caffe/caffe2/torch/pytorch.

The other category seems to be GPU packages that need to be built harmoniously with pytorch. Honestly, this feels a little bit like a "conda-forge" problem in the sense that if we had more than 6 hours of compilation time, and likely more than 2 cores to compile on, we could build in the prescribed amount of time.

Pytorch is:

[x] Documenting their versions
[x] Not depending on any closed source build system

Which is honestly more than we can hope for.

h-vetinari commented 2 years ago

Pytorch is:

Documenting their versions

Not depending on any closed source build system

Which is honestly more than we can hope for.

Yes, that's a great start. I disagree that we can't have higher aspirations though. 🙃

Honestly, this feels a little bit like a "conda-forge" problem in the sense that if we had more than 6 hours of compilation time, and likely more than 2 cores to compile on, we could build in the prescribed amount of time.

Indisputably, though 6h is already a whole bunch more than we had in pre-azure days. "capable of building on public CI" (in some sequence of individual chunks) is not an unreasonable wish I think.

@h-vetinari if you want to help on this effort, I think packaging onnx-tensorrt would be very helpful and is quite independent from the effort here.

Yes, interested, but low on time at the moment...

rgommers commented 2 years ago

Indisputably, though 6h is already a whole bunch more than we had in pre-azure days. "capable of building on public CI" (in some sequence of individual chunks) is not an unreasonable wish I think.

Agreed, that would be a good thing to have, and a reasonable ask to upstream (which I'll make next time I meet with build & packaging folks). Looking at the updated table, there's only a couple of builds that don't fit and they're not ridiculously far from the limit: ~ 6h+ (3933/4242 309 remaining). That said, breaking it in half so it comfortably fits would be better.

Another thing that is likely coming in a future release is the ability to reuse the non-CPython-specific parts between builds. Because 95% of the build doesn't depend on Python version, so having to rebuild everything 4 times for each supported Python version is a massive waste.

hmaarrfk commented 2 years ago

@rgommers FWIW, you essentially fly through most of the builds until you get to the large GPU kernels which need to be compiled for every data type, every GPU architecture, and then all put together. So the "3000 files to compile" vs "1800" is really misleading since only 500 files take the compilation time.

As for building as a library: by adjusting the tests, I should be a good place to get the CPU build of https://github.com/conda-forge/pytorch-cpu-feedstock/pull/112 working. It doesn't seem to move the needle very much. Again, due to the fact that the intensive stuff still takes as much time as it did before. (The CPU build still takes about 2 hours even without the python stuff).

hmaarrfk commented 2 years ago

ok, i spoke too soon. While you can disable BUILD_PYTHON by setting it to OFF or 0.

It seems to be hard to USE the prebuild library that you install in an earlier run.

There seems to be 3 natural checkpoints that they create for their own reasons that might be helpful to us. These checkpoint already get installed, but in their standard build process get "copied" into the python module (as required by pip installed packages)

libc10
libtorch_cpu (this seems like it contains some GPU symbols too -- This seems to take 1.5hrs for CPU only builds. so on the 6 hour constrained GPU builds, splitthing this off as an extra package would be helpful.
libtorch_gpu

They all seem to get assembled by libtorch

hmaarrfk commented 2 years ago

I'm not really sure conda is setup to detect the precise hardware, but rather the version of the cuda library.

It is quite hard to choose a hardware cutoff value. I don't really want to be choosing it at this level.

I personally have some setups with new and old GPUs. Crazy right! Though I may be an exception. I would be happy if things worked on my fancy new one.

hmaarrfk commented 2 years ago

even more radically, we can even try to split our packages into compute targets, not cuda targets.

Please open an other issue regarding dropping architectures.

Maybe: https://github.com/conda-forge/conda-forge.github.io/issues?q=is%3Aissue+is%3Aopen+gpu

isuruf commented 2 years ago

This discussion came up multiple times before. Please lookup the discussions in this feedstock. We agreed that it's better to follow upstream build scripts at https://github.com/pytorch/builder/blob/7fbb9d887a39be2c9ed55dea2a4b22201425e0bc/conda/pytorch-nightly/build.sh to avoid surprises.

isuruf commented 2 years ago

@ngam, it wasn't meant to be condescending. Sorry about that. I just wanted to make you aware that this discussion has happened before and I didn't want the discussion to happen again and again. If you have new information about this, we are certainly open to discussion.

h-vetinari commented 2 years ago

Well, this discussion became really hard to follow with the deleted comments (@ngam, please don't do that...).

From what I can tell, GPU builds were passing when building for a single GPU arch. Personally I wouldn't mind an explosion of the build matrix if it means we can build everything in CI.

But it's not clear to me that conda has the capability to detect the GPU architecture yet, and even then, we'd still have to solve the question of multi-GPU setups with different arches.

ngam commented 2 years ago

(@ngam, please don't do that...).

Sorry about messing this the flow of the convo. Your general understanding is correct (though, not only a "single" arch, it could be multiple, e.g. cuda102 with less arches passes on the CI sometimes). I made a mistake by inserting the topic into this specific conversation, so I deleted my comments. (Isuruf, no worries, it is my fault.) I was too focused on getting us under the 6-hour limit; notwithstanding the concerns (conda implementation, multiple GPUs, upstream compatibility, cuda issues, etc. --- which could all be addressed), I think it would be a little too crazy to implement such a change without some ecosystem-wide conversation first. We already have inconsistent implementation of __cuda (some packages like tensorflow have it; others like pytorch do not; will the maintainers of cudatoolkit, cudnn, and nccl ever adopt it?). Imagine if we move pytorch from cuda112-like variants to sm80-like variants... Plus, we cannot yet get tensorflow to pass on the CI at all, so less reason to contemplate this... Anyway, apologies, all. Please know I do want to be helpful, but sometimes I myself get in the way of that 😞

rgommers commented 2 years ago

I actually think we should likely skip the unbundling, and build an intermediary output instead. I'm mostly using this effort to try to understand them, and understand their build system.

I received some context on the USE_SYSTEM_xxx env vars. These were contributed by someone from OpenSuse to allow building against system libraries. PyTorch maintainers agreed to merge the changes, but they're completely untested in PyTorch CI. So recommended to be careful; if it doesn't save significant time better not to use it probably.

isuruf commented 2 years ago

@ngam, one optimization we can try out is using --threads option in nvcc 11.2 to do parallel compilation. Do you know if it's already done or not?

hmaarrfk commented 2 years ago

These were contributed by someone from OpenSuse to allow building against system libraries.

This is a good point

ngam commented 2 years ago

@ngam, one optimization we can try out is using --threads option in nvcc 11.2 to do parallel compilation. Do you know if it's already done or not?

I haven't thought of this before --- I have no idea to be honest. Let me try to look into it. In a previous (deleted comment), I said I noticed the GPU compilation becomes more serial-like (or more accurately less parallel) once we hit the GPU kernels (towards the end of the compilation). I am not very certain, but that's been my observation with tensorflow (in general alignment with hmaarrfk's statement earlier about hitting the GPU kernels)

Another option (and I should check the issues and PRs before I speak too much about this): We can try to use clang for the GPU kernels instead of nvcc if we can gain any speed...

hmaarrfk commented 2 years ago

So I guess my ultimate goal is to:

Get to an intermediate package that I can share between builds of python. (call this libtorch)
Get that intermediate package build on conda-forge infrastructure.
Use the intermediate package when building pytorch bindings
Relax.

Even if I can't do 2. (buiding CUDA on CIs), it would still be immensely valuable to be able to do 1. O(N_CUDA_BUILDS) times, instead of O(N_CUDA_BUILDS * N_PYTHON_BUILDS) times. Currently 16 * 3 hour builds takes about 48 hours. It may be closer to 36 hours, but it is still a significant amount of time.

However, (ignoring all comments about the choice of c-dependency management), it seems like I still have to do alot of work to cleanup the build process for pytorch. For what its worth, it seems like it start and ends in a about 200 lines of a CMakeLists.txt file https://github.com/pytorch/pytorch/blob/master/caffe2/CMakeLists.txt#L1917

I totally understand that in a pypi+pip first world, the choices made above seem correct. However, we've had quite a few people trying to link to libtorch.so and showing that they can build ontop of it.

@rgommers I'm trying to limit the number of asks we have of upstream. Do you think it would be reasonable to ask them to:

Split off the pytorch build more cleanly from the libtorch build?
Add an option to not copy the library files, and associated headers, into the site-packages directory?

My Ideal scenerio is:

libtorch

cmake -DBUILD_PYTHON=OFF [...]
make install

pytorch

# install libtorch
# build with: 
pip install . --no-deps
# or 
cmake -DBUILD_PYTHON=ON -DBUILD_LIBS=OFF [...]

ngam commented 2 years ago

My Ideal scenerio is:

libtorch

Does this help? https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/libtorch.rb

hmaarrfk commented 2 years ago

Does this help? https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/libtorch.rb

Thank you. Actually, this kinda gets to the same point I got with libtorch in https://github.com/conda-forge/staged-recipes/pull/19103 where I build the libraries. It really helps validate my approach.

I probably should cleanup my approach alot. I don't like the many subpackages I created, but the general build steps are there.

Howerver, it doesn't do step 2, which is to compile the python package linking it against the shared lib.

ngam commented 2 years ago

Howerver, it doesn't do step 2, which is to compile the python package linking it against the shared lib.

Yes and if I remember correctly back when I used brew, I personally didn't manage do step 2 based on their approach. They also have a "torchvision" package, which is really libtorchvision fwiw: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/torchvision.rb

ngam commented 2 years ago

Do you think it would be reasonable to ask them to:

Split off the pytorch build more cleanly from the libtorch build?

This, in theory, should be quite beneficial to upstream. Btw, mxnet does exactly this and it works quite well in my experience. Mentioning mxnet as an option in case you want to see their build setup. (I am not sure about number 2 in this list; I don't have a full understanding)

hadim commented 1 year ago

If the main motivation to split pytorch in smaller packages is because of CI time constraints then what about GH Large Runners?

Just throwing an idea here in case it can decrease the maintenance burden. It seems more and more important as more and more cf packges are built against pytorch.

hmaarrfk commented 1 year ago

If the main motivation to split pytorch in smaller packages is because of CI time constraints

This is an important motivation. And likely the most critical one.

As a second bonus, i would rather not have 4x the number of uploads for each python version.

then what about GH Large Runners?

I'm not sure how to use them at Conda-forge. Do you know how to enable it? PR welcome!

hadim commented 1 year ago

We use github_actions as the main CI in our private conda-forge-like organization but it seems like it's not allowed to do that on conda-forge.

When editing conda-forge.yml and adding:

provider:
  linux_64: ["github_actions"]
  osx_64: ["github_actions"]
  win_64: ["github_actions"]

then regeneration fails because of:

INFO:conda_smithy.configure_feedstock:Applying migrations: /tmp/tmpba_a2ikw/share/conda-forge/migrations/python311.yaml
Traceback (most recent call last):
  File "/home/hadim/local/micromamba/bin/conda-smithy", line 10, in <module>
    sys.exit(main())
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/cli.py", line 670, in main
    args.subcommand_func(args)
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/cli.py", line 486, in __call__
    self._call(args, tmpdir)
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/cli.py", line 491, in _call
    configure_feedstock.main(
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/configure_feedstock.py", line 2289, in main
    render_github_actions(env, config, forge_dir, return_metadata=True)
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/configure_feedstock.py", line 1275, in render_github_actions
    return _render_ci_provider(
  File "/home/hadim/local/micromamba/lib/python3.10/site-packages/conda_smithy/configure_feedstock.py", line 653, in _render_ci_provider
    raise RuntimeError(
RuntimeError: Using github_actions as the CI provider inside conda-forge github org is not allowed in order to avoid a denial of service for other infrastructure.

Also that would only enable the regular GH Actions workers and not the large runners ones for which I think we must pay (that being said it's probably worth putting some money on this, happy to contribute as well).

@hmaarrfk do you think it would be possible to make an exception here by enabling GH Actions as CI only for that repo? That would be only to perform a couple of build experiments and check whether it's worth or not before moving to potentially large runners.

hmaarrfk commented 1 year ago

that being said it's probably worth putting some money on this, happy to contribute as well)

Hmm. I'm not sure how donations are managed. Lets not get side tracked by this conversation here. but maybe you can express your desires in https://github.com/conda-forge/conda-forge.github.io for greater visibility.

do you think it would be possible to make an exception here by enabling GH Actions as CI only for that repo? That would be only to perform a couple of build experiments and check whether it's worth or not before moving to potentially large runners.

you can probably edit out the check in configure_feedstock.py yourself. have you tried that?

h-vetinari commented 1 year ago

that being said it's probably worth putting some money on this, happy to contribute as well)

Hmm. I'm not sure how donations are managed. Lets not get side tracked by this conversation here. but maybe you can express your desires in https://github.com/conda-forge/conda-forge.github.io for greater visibility.

See here. There have been ongoing efforts to get something like this done for well over two years, but there are a lot of moving pieces (not all of them technical) to sort out.

h-vetinari commented 1 year ago

Just saw this recent upstream issue about splitting off a libtorch -- that would be amazing for us. Given the 6h timeout limit, I'd suggest we build this on a different feedstock and then depend on it here.

hmaarrfk commented 10 months ago

I feel like it might be time to try again with pytorch 2.x.... i'm just kinda tired of locking up some of my servers compiling this stuff.

carterbox commented 10 months ago

I've been running build benchmarks recently by piping the build logs through ts. I don't have any results yet, but the ideas I've been playing with are:

Building for major archs only
Trying to speed up linking by using mold instead of ld
Playing with NVCC compile options: specifically --threads which was introduced in CUDA 11.5 and separable compilation

I'm compiling libtorch without python. If I can't get that below 6 hours with 2 cores, then it's still not plausible to build the entire package on the feedstock.

hmaarrfk commented 10 months ago

I would be happy just having to build one or two libraries, to then start a ci job for all the different python packages.

these libraries could be built in a different feedstocks if needed.

carterbox commented 10 months ago

🤔 You are suggesting that you would built libtorch offline (at most 2 archs x 3 platforms x 2 blas x 2 cuda), then the feedstock would build pytorch (at most 4 python x 2 archs x 3 platforms).

platforms - osx, win, linux archs - arm, ppc64le, x86 blas - mkl, openblas cuda - 11.8, 12.0

That makes some sense. Do we already have a feel for how much time it takes to compile libtorch vs the python extension modules? Do the Python extension modules even have a CUDA dependence or do they just link to any libtorch_cuda.so?

hmaarrfk commented 10 months ago

platforms would be limited to at most linux + cuda.

Others seem fine

carterbox commented 10 months ago

Here's the results from my local machine for build-only (no setup including cloning or downloading deps), the build time difference between -DBUILD_PYTHON:BOOL=ON/OFF seems negligible.

On my machine using cmake --parallel 2, CUDA 12.0, and nvcc --threads 2 :

4.50 hours major and minor archs listed in the current recipe
2.75 hours only major archs
2.50 hours only major archs using mold as the linker.

No sure how much slower it will be running in a docker container on the CI.

In summary, the most immediate strategy for reducing build times which is not discussed above should be to prune cuda target archs to major only. This may reduce build times by somewhere between half and a third? Who knows, it might bring build time down to an unreliable 5.9 hours. 😆

As mentioned above, patch work with upstream on the CMakeLists so that pytorch can be build separately from libtorch (in another feedstock) would be probably be helpful too. Since the python specific build time seems negligible, this won't reduce build time for a single variant but should reduce the build matrix and thus build time over all variants.

hmaarrfk commented 10 months ago

I don't really want to have the conversation of the supported architectures in individual feedstocks.

Can we have the discussion a more central location like: https://github.com/conda-forge/cuda-feedstock

Then maybe we can have best practices established.

conda-forge / pytorch-cpu-feedstock

Splitting this package in managable chunks #108

Comment:

Step 1: No more git clone

Option 1: Split off Dependencies:

Option 2 - step 1: Build a libpytorch package or something

Option 2 - step 2: Depend on new ATen/libpytorch package

Compilation time progress