conda-forge / pytorch-cpu-feedstock

A conda-smithy repository for pytorch-cpu.
BSD 3-Clause "New" or "Revised" License
16 stars 41 forks source link

State of pytorch infra? #229

Open baszalmstra opened 3 months ago

baszalmstra commented 3 months ago

Comment:

Hey dear maintainers and contributors.

Conda is used quite a bit in the ML ecosystem. It's a great option because installing pytorch for your particular system should be just conda install pytorch which would install a pytorch version targeting your version of Cuda, ROCm, or CPU architecture. I love this!

However, there are some issues that I have been facing.

I see people switching to pip or using the pytorch channel which introduce their own problems.

It looks like a number of these issues are related to infrastructure problems. I would love to contribute to improve this but Im not entirely sure where to start so Im opening this issue to start a conversation and get in contact with the people who do.

hmaarrfk commented 3 months ago

which one of the challenges would you like to tackle. In my experience it is going to be nearly impossible to tackle them all at once.

Focus on one area of "need" and work toward it.

If you can get the builds to run on Azure, that is really easy to main, we click merge.

I might have dragged my feet on a merge request: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/225#issuecomment-2011052510

I know i'm being "selfish" but co-instability of pytorch and tensorflow is important to me. so i felt like between the two:

  1. Most updated versions
  2. co-installable tensorflow + pytorch

i chose #2. I could be swayed (I have my own channels that I maintain for this reason) but addressing the "tensorflow problem" is also related to this https://github.com/conda-forge/tensorflow-feedstock/issues/378

baszalmstra commented 3 months ago

Personally, I think not having windows builds is a big reason to not use pytorch from conda-forge. Even if the version is not completely up to date, not having a version available is worse. ;)

However, I have been building pytorch (-cpu) with rattler-build on windows and it requires a lot of resources, I think that is also the reason the effort in https://github.com/conda-forge/pytorch-cpu-feedstock/pull/134 was halted?

But I would be happy to revive that PR if things have changed in the mean time?

hmaarrfk commented 3 months ago

Even if the version is not completely up to date, not having a version available is worse. ;)

I'm not sure that is true. Our migration infrastructure requires all packages to be up to date for all platforms.

So if somebody contributes a windows package one day, then gets pulled away due to other priorities, then pytorch is effectively halted.

rattler likely isn't the cause of the slowdown.

The usage of "git" to clone the large repo is slow. The multiple GPU architectures is really problematic.

But I would be happy to revive that PR if things have changed in the mean time?

Please do! Lets see how far things go. Typically getting maintainers in sync with contributors (just in terms of time to review) can kill efforts like this.

baszalmstra commented 3 months ago

I'm not sure that is true. Our migration infrastructure requires all packages to be up to date for all platforms.

No sorry I meant that not having the entire feedstock on the latest version is is less of a problem than not having a windows build at all. Its a justification for looking at the windows build before looking at bumping to the latest version. :)

rattler likely isn't the cause of the slowdown.

It isnt, but rattler-build makes it easier to iterate on the recipe and build scripts. :)

Please do!

👍

baszalmstra commented 3 months ago

The usage of "git" to clone the large repo is slow.

I noticed that pytorch publishes the source (including submodule) as a .tar.gz with every release. Has it been tried to use that instead of using git? I did notice it includes symlinks which might be an issue on windows.

hmaarrfk commented 3 months ago

I noticed that pytorch publishes the source (including submodule)

this might be new.

This would be an appreciated change independent of windows as currently about 20-25 mins is spent cloning the repo.

PR welcome.