Closed hmaarrfk closed 3 years ago
@rgommers that's great - when you provide that update, could you please include information about how the issues you raised earlier in the thread will be addressed, and what outstanding issues you see? It would be really helpful for those of us packaging downstream libs to understand what recommended best practices are, including what we should tell our users about how to update downstream packages.
(In particular, we still haven't managed to create a package that depends on RapidsAI, and can be reliably installed and updated by users. So I guess our biggest unsolved issue is how to be a downstream lib of a downstream lib!...)
@rgommers: @henryiii there is a plan now for a GPU enabled PyTorch package on conda-forge, supported by the PyTorch team. An update on that plan will follow soon.
Any update on this?
Pytorch is now using a conda toolchain to build and test from a docker image in CI. This was done in the PRs that closed pytorch/pytorch#37584.
I'm trying to pull in the recipe from defaults in https://github.com/conda-forge/pytorch-cpu-feedstock/pull/20
Help would be appreciated.
If you need rights to my repo, please let me know.
I think a good plan would be to:
@soumith I really understand your concern in:
PyTorch is slower than X because of a packaging issue. They simply assume the worst.
However, very recent anecdotal evidence shows that at the 50% performance level, users simply aren't concerned with this kind of penalty so long as they can get their stack installed. The network effect of conda-forge makes it super valuable in getting packages from subfields of machine learning installed, especially for those that defaults and pytorch don't have time to package.
I failed to summarize the long meeting and notes we had on this (apologies) and some things changed in conda-forge in the meantime that change the details of what we had discussed in the call, but the right approach here is still syncing the binaries built by the PyTorch team in the pytorch
channel to conda-forge
. We're working on this (slowly). Adding CI to PyTorch to build with a Conda toolchain by @mattip was part of that. Then @scopatz is making the change to cpuonly
and gpu
mutex packages in https://github.com/pytorch/builder/pull/488, and will work on the next steps for getting PyTorch including GPU support on conda-forge.
@hmaarrfk if you'd want to help out with moving that forward, that seems healthier than keep pushing this package.
However, very recent anecdotal evidence shows that at the 50% performance level, users simply aren't concerned with this kind of penalty so long as they can get their stack installed.
This claim is definitely not true. some users aren't concerned, but there are a lot of PyTorch bug reports about "torch.<somefunc>
is slower now than in older release 1.x.y".
The network effect of conda-forge makes it super valuable in getting packages from subfields of machine learning installed, especially for those that defaults and pytorch don't have time to package.
If you have a set of those, maybe they should simply go in their own channel for the time being, which depends on both conda-forge
and pytorch
channels? Name it pytorch-contrib
or something like that?
but the right approach here is still syncing the binaries built by the PyTorch team in the pytorch channel to conda-forge
This is certainly not the right approach. I don't see why pytorch is special. We should just build them on conda-forge. Benefits of building them on conda-forge is that we know that they are compatible with the rest of the stack. Bots are there to do maintenance and rebuild with latest downstream packages. Also conda-forge builds on more architectures.
@hmaarrfk, building them on conda-forge is totally fine with me.
This is certainly not the right approach. I don't see why pytorch is special. We should just build them on conda-forge.
You don't even have GPU build hardware, right? There are more reasons, I hope @scopatz can summarize when he gets back; he said the exact same thing as a "conda-forge first principles" type response, but I believe I managed to convince him.
You don't even have GPU build hardware, right?
We have a docker image with the compilers. Hardware is not needed to build AFAIK. After building the package, we can upload to a testing label and then move the package to main
after doing testing on a local machine with the hardware.
I mean, the fact that Anaconda has already published conda packages to https://anaconda.org/anaconda/pytorch about 6 months ago mostly invalidates whatever arguments we wish to have about control.
We are quite similar to defaults since a recent sync, so I think it is reasonable to ask that we collaborate instead of diverge.
And for reference, here is a pointer to the installation instructions of the pytorch family package i was talking about https://github.com/pytorch/fairseq#requirements-and-installation
I understand there isn't always an immediate business case (at Facebook or Continuum) to create a high quality package for everything, which is where conda-forge comes in.
We have a docker image with the compilers. Hardware is not needed to build AFAIK. After building the package, we can upload to a testing label and then move the package to
main
after doing testing on a local machine with the hardware.
Just doing some manual testing seems like a recipe for broken packages. And you probably won't be able to test everything that way (e.g. multi-GPU stuff with torch.distributed
). The battery of tests for PyTorch with various hardware and build configs is very large, and it's very common to have just some things break that you never saw locally.
And for reference, here is a pointer to the installation instructions of the pytorch family package i was talking about https://github.com/pytorch/fairseq#requirements-and-installation
That's one package, and it has a "help wanted" issue for a conda package: https://github.com/pytorch/fairseq/issues/1717. Contributing there and getting a first conda package into the pytorch
channel seems like a much better idea than doing your own thing. You can then also use the CI system, so you can test the builds, and I'd imagine you get review/help from the fairseq
maintainers.
Just doing some manual testing seems like a recipe for broken packages. And you probably won't be able to test everything that way (e.g. multi-GPU stuff with torch.distributed). The battery of tests for PyTorch with various hardware and build configs is very large, and it's very common to have just some things break that you never saw locally.
How is this different from other packages like numpy
, openblas
, etc.?
How is this different from other packages like
numpy
,openblas
, etc.?
For NumPy you actually run the tests. Examples:
v1.19.1
test run: https://travis-ci.com/github/conda-forge/numpy-feedstock/jobs/363588731Plus the number of ways to build NumPy is far smaller than with PyTorch (e.g., check the number of USE_xxx
env vars in PyTorch's setup.py
). So, it's very different.
I'm suggesting we run the tests on a local machine with GPU hardware.
We don't test all the code paths in numpy. For eg: there's AVX512 code paths that we don't test. We don't test POWER9 code paths. It's impossible to test all code paths.
Plus the number of ways to build NumPy is far smaller than with PyTorch
There are lots of different ways to build openblas. See how many options we set in https://github.com/conda-forge/openblas-feedstock/blob/master/recipe/build.sh#L26-L50
I have to agree that local testing is a poor substitute for a proper CI matrix, but of course that's not possible without a CI having GPUs, see https://github.com/conda-forge/conda-forge.github.io/issues/1062 - considering the impact conda-forge is having on the scientific computing stack in python, one would hope this should be a tractable problem... (note the OP of that issue; it might be possible to hook in self-hosted machines into the regular azure CI).
With a concerted (and a bit more high-level) effort, I believe that it might be realistically possible to convince Microsoft to sponsor the python packaging ecosystem with some GPU CI time on azure, but admittedly, that's just in my head (based on some loose but very positive interaction I had with their reps).
Re: "build coverage" - 100% might not be possible, but one can get pretty close, depending on the invested CI time. For example, even if we can now have 3-4 different CPU builds per platform/python-version/blas-version (via https://github.com/conda/conda/pull/9930), it's still "only" a question of CI time to multiply the matrix of (e.g.) conda-forge/numpy-feedstock#196 by 3-4. For packages as fundamental as numpy/scipy, this is IMO worth the effort. Pytorch could fall into that category as well.
I have to agree that local testing is a poor substitute for a proper CI matrix
How is it different if we run the tests in CI or locally before uploading to main
label?
How is it different if we run the tests in CI or locally before uploading to
main
label?
Reproducing a full matrix of combinations locally (different Arches/OSes/python version/GPUs/CPUs/etc.) is not fundamentally impossible to do locally (I just said "poor substitute"), but would take a huge amount of time (incl. complicated virtualization setup for other OSes/Arches), and be error-prone & intransparent, compared to CI builds that run in parallel and can easily be inspected.
Can we please stay on topic? @rgommers wants to copy binaries from pytorch
channel which is not definitely not transparent nor can it be easily inspected.
If anyone wants to talk more on this, please come to a core meeting.
I'm all for building in conda-forge BTW, just saying that I can see the argument why this shouldn't come at the cost of reduced (GPU-)CI coverage (and hence bringing up the GPU-in-CF-CI thing, which would enable to kill both birds with one stone).
For what it's worth I am also interested in pytorch on conda-forge (with cuda and no-cuda support). In addition to all the advantages cited above, it would allow compiling against pytorch for conda packages.
Copying binaries is fine for me (I am being pragmatic here) but as probably everyone here I would largely prefer to have those packages directly build on conda forge.
For those of you involved in packaging pytorch, we are interested in pushing https://github.com/conda-forge/pytorch-cpu-feedstock/pull/22
through with the package name pytorch
creating direct competition with pytorch's advertised pytorch
package on their website.
Now that conda-forge supports GPUs, I Think it is safe for us to do so.
If there are any other reasons that should be brought up at this stage, please let us know in the PR .
Thanks for all your input so far!
Now that conda-forge supports GPUs
Is the current status documented somewhere? I found https://github.com/conda-forge/conda-forge.github.io/issues/901 as the TODO item to write docs, maybe there's something else?
the pull request #22 is probably the best current documentation on how to use it :D
Hello! I'm from the release engineering team for PyTorch. Please let us know if there's any way we can assist in making the conda-forge
installation experience for pytorch
as smooth as possible.
cc @malfet
@seemethere, thanks for the offer. One task you could help with is a way to collect the licenses/copyright notices of the third party dependencies to comply with their license terms.
There's not a lot of documentation AFAIK, but https://github.com/conda-forge/goofit-split-feedstock/blob/master/recipe/meta.yaml is an example of a split GPU / CPU package.
@seemethere, thanks for the offer. One task you could help with is a way to collect the licenses/copyright notices of the third party dependencies to comply with their license terms.
any updates on this?
Closing this issue as the original issue has been resolved.
I opened #34 to discuss licensing.
@jjhelmus it seems you were able to build pytorch GPU without needing to have variants
https://anaconda.org/anaconda/pytorch/files?version=1.0.1
Is that true?
If so, what challenges do you see moving this work to conda-forge?