conda-forge / pytorch-cpu-feedstock

A conda-smithy repository for pytorch-cpu.
BSD 3-Clause "New" or "Revised" License
18 stars 50 forks source link

Pytorch 2.4.1, python 3.13, cudnn9 #261

Closed regro-cf-autotick-bot closed 1 month ago

regro-cf-autotick-bot commented 1 month ago

This PR has been triggered in an effort to update python313.

Notes and instructions for merging this PR:

  1. Please merge the PR only after the tests have passed.
  2. Feel free to push to the bot's branch to update this PR if needed.

Please note that if you close this PR we presume that the feedstock has been rebuilt, so if you are going to perform the rebuild yourself don't close this PR until the your rebuild has been merged.


If this PR was opened in error or needs to be updated please add the bot-rerun label to this PR. The bot will close this PR and schedule another one. If you do not have permissions to add this label, you can use the phrase code>@<space/conda-forge-admin, please rerun bot in a PR comment to have the conda-forge-admin add it for you.

This PR was created by the regro-cf-autotick-bot. The regro-cf-autotick-bot is a service to automatically track the dependency graph, migrate packages, and propose package version updates for conda-forge. Feel free to drop us a line if there are any issues! This PR was generated by - please use this URL for debugging.

Also includes...

Fixes https://github.com/conda-forge/pytorch-cpu-feedstock/issues/264 Fixes https://github.com/conda-forge/pytorch-cpu-feedstock/pull/259

conda-forge-webservices[bot] commented 1 month ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

hmaarrfk commented 1 month ago

~if anybody could rebuild OSX that would be great!~

Edit: nevermind, i forgot about

256 and #259

hmaarrfk commented 1 month ago

cudnn 9 would make this package incompatible with cupy. https://github.com/conda-forge/cupy-feedstock/pull/286

Would it be appropriate to build out 2 versions, one for cudnn 9, the other for cudnn 8???

conda-forge-webservices[bot] commented 1 month ago

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found some lint.

Here's what I've got...

For recipe/meta.yaml:

conda-forge-webservices[bot] commented 1 month ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

hmaarrfk commented 1 month ago

I thought that it i could get things to work by downgrading to 2.4.0 instead of 2.4.1.... but it still doesn't work

Tobias-Fischer commented 1 month ago

It looks like it's going ok so far? :)

hmaarrfk commented 1 month ago

No I ran it locally in docker on a linux-64 host and ran into the same problem as https://github.com/conda-forge/pytorch-cpu-feedstock/pull/256

hmaarrfk commented 1 month ago

Alright fingers crossed.

hmaarrfk commented 1 month ago

Ok fingers crossed.

hmaarrfk commented 1 month ago

@leofang I believe I got the export variable correct here. I'll have to check after this build is complete but we can try to ensure we close https://github.com/conda-forge/pytorch-cpu-feedstock/issues/264 as part of this.

hmaarrfk commented 1 month ago

Thank you @h-vetinari for the fixes to the rerendering. I'm trying to get this done before attempting to look at the protobuf issues in https://github.com/conda-forge/pytorch-cpu-feedstock/pull/265 it seems that there are some things to fixup and cleanup.

jakirkham commented 1 month ago

Added a few "fixes" notes to the OP based on the discussion here/diff. Please double check that it looks ok

hmaarrfk commented 1 month ago

/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_build_env/bin/../lib/gcc/aarch64-conda-linux-gnu/12.4.0/../../../../aarch64-conda-linux-gnu/bin/ld: error: linker script file '/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/work/cmake/linker_script.ld' appears multiple times

Any ideas @leofang ???

I disabled it for now to ensure that the rest of the build passes.

jakirkham commented 1 month ago

For reference, this is the CI build (also attached log for posterity)

Searching did not provide lots of results, but did find this thread. Wasn't seeing a clear resolution in that thread. Though it does indicate someone else ran into this issue (presumably building with their system toolchain). This makes me think there is a bug in PyTorch's build scripts

jakirkham commented 1 month ago

Different library, but same error ( https://github.com/m01/rock64-arch-linux-build/issues/17 ). In that case, the upstream fix linked is commit ( https://github.com/u-boot/u-boot/commit/d6507e6fd9e0a6f1a8dd28c18cd320c1f861269e ), which removes a duplicated target from a Makefile

So is there some target that is being rebuilt multiple times?

hmaarrfk commented 1 month ago

is this running in emulation mode? the build is taking a really long time...

jakirkham commented 1 month ago

Think the configuration needs a tweak. Noted above: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/261#discussion_r1786911294

Edit: Also a re-render after ( which now can be done with the bot https://github.com/conda-forge/admin-requests/pull/1098 ! 🥳 )

leofang commented 1 month ago

So is there some target that is being rebuilt multiple times?

It seems to be the linker script being passed to the linker twice

  --   Shared LD flags       : -Wl,-O2 -Wl,--sort-common -Wl,-z,relro -Wl,-z,lazy -Wl,--allow-shlib-undefined -Wl,-rpath,/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib -Wl,-rpath-link,/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/targets/sbsa-linux/lib -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/targets/sbsa-linux/lib/stubs -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_build_env/targets/sbsa-linux/lib -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_build_env/targets/sbsa-linux/lib/stubs -T/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/work/cmake/linker_script.ld -Wl,--no-as-needed -Wl,-O2 -Wl,--sort-common -Wl,-z,relro -Wl,-z,lazy -Wl,--allow-shlib-undefined -Wl,-rpath,/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib -Wl,-rpath-link,/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/lib -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/targets/sbsa-linux/lib -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pl/targets/sbsa-linux/lib/stubs -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_build_env/targets/sbsa-linux/lib -L/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/_build_env/targets/sbsa-linux/lib/stubs -T/home/conda/feedstock_root/build_artifacts/libtorch_1727958353053/work/cmake/linker_script.ld -rdynamic

I can dig around. There are too many changes in flux so may not be easily isolated.

hmaarrfk commented 1 month ago

There are too many changes in flux so may not be easily isolated.

Ok let me try to get this new image built out on alma, and we can try in a separate PR for the linker optimization.

leofang commented 1 month ago

Ok let me try to get this new image built out on alma, and we can try in a separate PR for the linker optimization.

Yes plz do. I can also try to build the recipe locally and poke around next week, or just ask my colleague if this line is redundant and the root cause of the linker flag being passed twice: https://github.com/pytorch/pytorch/blob/a93d3873e97973fbc0009245579ee4e4fa7f155a/setup.py#L1173

jakirkham commented 1 month ago

It seems to be the linker script being passed to the linker twice

This is almost certainly the issue

The harder question is why that is happening

Edit: Missed the last bit. That setup.py doesn't look right. Agree we should patch that out. Though it may be covering up deeper configuration issues. This would benefit from an upstream issue/PR at the end to cleanup this stuff

hmaarrfk commented 1 month ago

PS. I am trying to upstream whatever patches I find: https://github.com/pytorch/pytorch/pull/137084

so when we do have a suggestion, we shoul definitely make it ;)

hmaarrfk commented 1 month ago

well the build seems to be progressing so lets let it run. we can unskip and merge in 10-12 hours.

hmaarrfk commented 1 month ago

Fixes to the LD script definitely welcome in that time ;)

hmaarrfk commented 1 month ago
bad patch that didn't work Got through the compilation check with: ``` $ git diff -- setup.py CMakeLists.txt diff --git a/CMakeLists.txt b/CMakeLists.txt index c4cd4b2c2..869b4f9d5 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -25,6 +25,8 @@ endif() # ---[ Project and semantic versioning. project(Torch CXX C) +set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -T${CMAKE_SOURCE_DIR}/cmake/linker_script.ld") + if(${CMAKE_SYSTEM_NAME} STREQUAL "Linux") set(LINUX TRUE) else() diff --git a/setup.py b/setup.py index 499e0b06e..f84b193b9 100644 --- a/setup.py +++ b/setup.py @@ -1152,7 +1152,7 @@ def main(): filein="cmake/prioritized_text.txt", fout="cmake/linker_script.ld" ) linker_script_path = os.path.abspath("cmake/linker_script.ld") - os.environ["LDFLAGS"] = os.getenv("LDFLAGS", "") + f" -T{linker_script_path}" + # os.environ["LDFLAGS"] = os.getenv("LDFLAGS", "") + f" -T{linker_script_path}" os.environ["CFLAGS"] = ( os.getenv("CFLAGS", "") + " -ffunction-sections -fdata-sections" ) ``` But I'll have to inspect the logs after completion to make sure the linker script got used.
hmaarrfk commented 1 month ago

Wow i spoke way too soon...

  /home/conda/feedstock_root/build_artifacts/debug_1728005337762/_build_env/bin/../lib/gcc/aarch64-conda-linux-g
nu/12.4.0/../../../../aarch64-conda-linux-gnu/bin/ld: cannot represent machine `i386:x86-64'
hmaarrfk commented 1 month ago
diff --git a/tools/setup_helpers/generate_linker_script.py b/tools/setup_helpers/generate_linker_script.py
index 11c397a9e..32f817fba 100644
--- a/tools/setup_helpers/generate_linker_script.py
+++ b/tools/setup_helpers/generate_linker_script.py
@@ -1,4 +1,5 @@
 import subprocess
+import os

 def gen_linker_script(
@@ -10,7 +11,8 @@ def gen_linker_script(
             line.replace("\n", "") for line in prioritized_text if line != "\n"
         ]

-    linker_script_lines = subprocess.check_output(["ld", "-verbose"], text=True).split(
+    ld = os.environ.get('LD', 'ld')
+    linker_script_lines = subprocess.check_output([ld, "-verbose"], text=True).split(
         "\n"
     )

probably a better patch candidate.

jakirkham commented 1 month ago

Thanks Mark! 🙏

Yeah was wondering if that linker script line was covering a deeper issue

Think we will need to take a closer look

Edit: Didn't see your new patch. Worth a try

hmaarrfk commented 1 month ago

I tried, i get back to the fact that the linker script is included twice. (I'm in a debug session)

hmaarrfk commented 1 month ago

With 30 cores (threadripper 2950 so they are really wimpy) you can get a failure at file 920/6271 with the changing the linker to take the environment variable.

hmaarrfk commented 1 month ago

Fingers crossed this does it:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index c4cd4b2c2..3d4243179 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -532,17 +532,6 @@ endif()
 option(TRACING_BASED
        "Master flag to build Lite Interpreter with tracing build option" OFF)
 option(BUILD_EXECUTORCH "Master flag to build Executorch" ON)
-# This is a fix for a rare build issue on Ubuntu: symbol lookup error:
-# miniconda3/envs/pytorch-py3.7/lib/libmkl_intel_lp64.so: undefined symbol:
-# mkl_blas_dsyrk
-# https://software.intel.com/en-us/articles/symbol-lookup-error-when-linking-intel-mkl-with-gcc-on-ubuntu
-if(LINUX)
-  set(CMAKE_SHARED_LINKER_FLAGS
-      "${CMAKE_SHARED_LINKER_FLAGS} -Wl,--no-as-needed")
-  set(CMAKE_SHARED_LINKER_FLAGS
-      "${CMAKE_SHARED_LINKER_FLAGS} $ENV{LDFLAGS}")
-endif()
-
 if(MSVC)
   # MSVC by default does not apply the correct __cplusplus version as specified
   # by the C++ standard because MSVC is not a completely compliant
diff --git a/tools/setup_helpers/generate_linker_script.py b/tools/setup_helpers/generate_linker_script.py
index 11c397a9e..9009604f2 100644
--- a/tools/setup_helpers/generate_linker_script.py
+++ b/tools/setup_helpers/generate_linker_script.py
@@ -1,4 +1,5 @@
 import subprocess
+import os

 def gen_linker_script(
@@ -10,7 +11,8 @@ def gen_linker_script(
             line.replace("\n", "") for line in prioritized_text if line != "\n"
         ]

-    linker_script_lines = subprocess.check_output(["ld", "-verbose"], text=True).split(
+    ld = os.environ.get("LD", "ld")
+    linker_script_lines = subprocess.check_output([ld, "-verbose"], text=True).split(
         "\n"
     )

seems to have gotten to 1461+

hmaarrfk commented 1 month ago

Their fix is included in https://github.com/pytorch/pytorch/commit/f217b470cc7ebacc62c8e87dbab8c4894d53e9b9

hmaarrfk commented 1 month ago

well feel free to look at aarch logs: https://github.com/conda-forge/pytorch-cpu-feedstock/actions/runs/11173834884/job/31062549907

other linux: https://github.com/conda-forge/pytorch-cpu-feedstock/actions/runs/11147217363/job/30981011412

other osx: https://github.com/conda-forge/pytorch-cpu-feedstock/runs/30993778983

jakirkham commented 1 month ago

It looks like the ARM CPU build on CI (attached log) had this error

CMake Error: The current CMakeCache.txt directory /home/conda/feedstock_root/build_artifacts/libtorch_1728060831124/work/build/CMakeCache.txt is different than the directory /home/conda/feedstock_root/build_artifacts/libtorch_172806083.94/work/build where CMakeCache.txt was created. This may result in binaries being created in the wrong place. If you are not sure, reedit the CMakeCache.txt
isuruf commented 1 month ago

Lol. That's a funny bug.

hmaarrfk commented 1 month ago

oh was I doing abbreviated builds so i missed it.... thanks isuru!!!

isuruf commented 1 month ago

You need a timestamp with 3*12 to hit the bug. :smile:

hmaarrfk commented 1 month ago

wow ok.... yeah i guess sed is bad.... we could switch t their new workflow where they claim to be able to cleanly split hte library and non-library portion.

jakirkham commented 1 month ago

Am seeing the following error in this CI job (with attached log):

-- Using third party subdirectory Eigen.
CMake Error at /home/conda/feedstock_root/build_artifacts/libtorch_1728075572459/_build_env/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find Python (missing: Python_INCLUDE_DIRS Interpreter
  Development.Module NumPy) (found version "3.11.10")
Call Stack (most recent call first):
  /home/conda/feedstock_root/build_artifacts/libtorch_1728075572459/_build_env/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  /home/conda/feedstock_root/build_artifacts/libtorch_1728075572459/_build_env/share/cmake-3.30/Modules/FindPython.cmake:673 (find_package_handle_standard_args)
  cmake/Dependencies.cmake:857 (find_package)
  CMakeLists.txt:857 (include)

Edit: Another jobs fails with the same error

hmaarrfk commented 1 month ago

its probably because i hardcoded the numpy include dir and its frozen in.

jakirkham commented 1 month ago

What if we added a symlink to the NumPy headers using a Python version agnostic location and provided the latter as input?

hmaarrfk commented 1 month ago

you are welcome to try, i can back off.

hmaarrfk commented 1 month ago

actuallly my guess is that cmake is invoked through a new codepath i didn't capture in my patch.

jakirkham commented 1 month ago

I trust your judgment Mark. Please let me know if I'm just missing things

Edit: Fixed a typo

hmaarrfk commented 1 month ago

I’m too tired for today. Just giving ideas.

hmaarrfk commented 1 month ago

I think it is a difference between numpy 2.0 and 2.1.

Is there a reason to to build with 2.0 vs 2.1?

The reason i ask is that the migration pins to 2.0: https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/main/recipe/migrations/numpy2.yaml#L48

h-vetinari commented 1 month ago

Is there a reason to to build with 2.0 vs 2.1?

Just because we started out conservatively. You can use 2.1 without issue, or pin to 2 directly

hmaarrfk commented 1 month ago

i'm not so convinced about my 2.1 theory anymore. I had successful builds with Linux64 and aarch64 in the past...

hmaarrfk commented 1 month ago

@isuruf it seems the sed “fix” was at the source of the issue. Any other ideas on how to make sed better? Or should we just restart builds periodically.