Closed ngam closed 10 months ago
Hi! This is the friendly automated conda-forge-linting service.
I just wanted to let you know that I linted all conda-recipes in your PR (recipe
) and found it was in an excellent condition.
@conda-forge-admin, please rerender
GCC 10 builds fail due to OneDNN:
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1686968646633/work/torch/csrc/jit/codegen/onednn/graph_helper.h:3:10: fatal error: oneapi/dnnl/dnnl_graph.hpp: No such file or directory
3 | #include <oneapi/dnnl/dnnl_graph.hpp>
|
OneDNN does not seem to be listed in the requirements for build: https://github.com/conda-forge/pytorch-cpu-feedstock/blob/f2474772cb070bd3131bbb349e0d6f856c79747a/recipe/meta.yaml GCC 12 builds fail because of an incompatibility between FBGMM and GCC 12: https://github.com/pytorch/pytorch/issues/77939 https://github.com/pytorch/FBGEMM/issues/1666 Apparently the error is a false positive and can be ignored. Users got around this error by just silencing maybe-uninitialized (this is recommended in FBGEMM`s README now):
export CXXFLAGS+='-Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull'
export CFLAGS+='-Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull'
I wanna help, but I do not really know how a feedstock works -.-
Thanks, I will follow up with your suggested fixes soon.
We all learn the conda-forge ways by watching đ
Hi! This is the friendly automated conda-forge-webservice.
I tried to rerender for you, but it looks like there was nothing to do.
This message was generated by GitHub actions workflow run https://github.com/conda-forge/pytorch-cpu-feedstock/actions/runs/5315469498.
We still have the onednn issue. Let me think how we address this......
Inspecting the first lines of the failed builds:
## Package Plan ##
environment location: /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1687227140059/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh
The following NEW packages will be INSTALLED:
_libgcc_mutex: 0.1-conda_forge conda-forge
_openmp_mutex: 4.5-2_kmp_llvm conda-forge
brotli: 1.0.9-h166bdaf_8 conda-forge
brotli-bin: 1.0.9-h166bdaf_8 conda-forge
bzip2: 1.0.8-h7f98852_4 conda-forge
ca-certificates: 2023.5.7-hbcca054_0 conda-forge
certifi: 2023.5.7-pyhd8ed1ab_0 conda-forge
charset-normalizer: 3.1.0-pyhd8ed1ab_0 conda-forge
cuda-version: 11.8-h70ddcb2_2 conda-forge
cudatoolkit: 11.8.0-h37601d7_11 conda-forge
cudnn: 8.8.0.121-h0800d71_1 conda-forge
future: 0.18.3-pyhd8ed1ab_0 conda-forge
icu: 72.1-hcb278e6_0 conda-forge
idna: 3.4-pyhd8ed1ab_0 conda-forge
ld_impl_linux-64: 2.40-h41732ed_0 conda-forge
libblas: 3.9.0-16_linux64_mkl conda-forge
libbrotlicommon: 1.0.9-h166bdaf_8 conda-forge
libbrotlidec: 1.0.9-h166bdaf_8 conda-forge
libbrotlienc: 1.0.9-h166bdaf_8 conda-forge
libcblas: 3.9.0-16_linux64_mkl conda-forge
libffi: 3.4.2-h7f98852_5 conda-forge
libgcc-ng: 13.1.0-he5830b7_0 conda-forge
libgomp: 13.1.0-he5830b7_0 conda-forge
libhwloc: 2.9.1-nocuda_h7313eea_6 conda-forge
libiconv: 1.17-h166bdaf_0 conda-forge
liblapack: 3.9.0-16_linux64_mkl conda-forge
libmagma: 2.7.1-hc72dce7_3 conda-forge
libmagma_sparse: 2.7.1-hc72dce7_4 conda-forge
libnsl: 2.0.0-h7f98852_0 conda-forge
libprotobuf: 3.21.12-h3eb15da_0 conda-forge
libsqlite: 3.42.0-h2797004_0 conda-forge
libstdcxx-ng: 13.1.0-hfd8a6a1_0 conda-forge
libuuid: 2.38.1-h0b41bf4_0 conda-forge
libuv: 1.44.2-h166bdaf_0 conda-forge
libxml2: 2.11.4-h0d562d8_0 conda-forge
libzlib: 1.2.13-hd590300_5 conda-forge
llvm-openmp: 16.0.6-h4dfa4b3_0 conda-forge
magma: 2.7.1-ha770c72_4 conda-forge
mkl: 2022.2.1-h84fe81f_16997 conda-forge
mkl-devel: 2022.2.1-ha770c72_16998 conda-forge
mkl-include: 2022.2.1-h84fe81f_16997 conda-forge
nccl: 2.18.3.1-h12f7317_0 conda-forge
ncurses: 6.4-hcb278e6_0 conda-forge
numpy: 1.21.6-py310h45f3432_0 conda-forge
openssl: 3.1.1-hd590300_1 conda-forge
pip: 23.1.2-pyhd8ed1ab_0 conda-forge
pkg-config: 0.29.2-h36c2ea0_1008 conda-forge
pysocks: 1.7.1-pyha2e5f31_6 conda-forge
python: 3.10.11-he550d4f_0_cpython conda-forge
python_abi: 3.10-3_cp310 conda-forge
pyyaml: 6.0-py310h5764c6d_5 conda-forge
readline: 8.2-h8228510_1 conda-forge
requests: 2.31.0-pyhd8ed1ab_0 conda-forge
setuptools: 67.7.2-pyhd8ed1ab_0 conda-forge
six: 1.16.0-pyh6c4a22f_0 conda-forge
sleef: 3.5.1-h9b69904_2 conda-forge
tbb: 2021.9.0-hf52228f_0 conda-forge
tk: 8.6.12-h27826a3_0 conda-forge
typing: 3.10.0.0-pyhd8ed1ab_0 conda-forge
typing_extensions: 4.6.3-pyha770c72_0 conda-forge
tzdata: 2023c-h71feb2d_0 conda-forge
urllib3: 2.0.3-pyhd8ed1ab_0 conda-forge
wheel: 0.40.0-pyhd8ed1ab_0 conda-forge
xz: 5.2.6-h166bdaf_0 conda-forge
yaml: 0.2.5-h7f98852_2 conda-forge
zstd: 1.5.2-h3eb15da_6 conda-forge
OneDNN is not there
I see you added mkl_include to meta.yml, but this is a different package with different headers and names. Pytorch has the -DUSE_MKLDNN CMake flag to use mkl instead of the newer oneAPI, but there has been effort to translate to oneAPI in pytorch: https://github.com/pytorch/pytorch/pull/32422 So I guess we should use that. The conda-forge package is just "onednn" and includes the missing header.
I tried adding onednn in https://github.com/conda-forge/pytorch-cpu-feedstock/pull/172/commits/48b28139fb57276fe18212a425c4b677888e7300 and it didnât work. I got the mkl-include idea from looking at how pytorch builds their own conda package (no onednn, but they use mkl-include)
What was the error when you added onednn? I cannot see the log.
My plan was to add onednn again if mkl-include didn't work (i.e., adding one piece at a time)
The compiler finds oneDNN, but eventually it fails with this error:
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1687262527413/work/torch/csrc/jit/codegen/onednn/operator.h:98:15: error: no matching function for call to 'dnnl::graph::op::set_attr(std::string&, std::__cxx11::basic_string<char>)'
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1687262527413/work/torch/csrc/jit/codegen/onednn/operator.h:98:16: note: cannot convert 'name' (type 'std::string' {aka 'std::__cxx11::basic_string<char>'}) to type 'dnnl::graph::op::attr'
It looks to me like a version mismatch between oneDNN and pytorch. Honestly no idea how to solve it -.- Sadly I cannot find any references to it online.
These official tips are 4 years old, but its the best I could find. https://github.com/pytorch/pytorch/blame/main/CONTRIBUTING.md#c-development-tips
EDIT: I found these: https://github.com/pytorch/pytorch/pull/103745 https://github.com/pytorch/pytorch/pull/97957 Which makes me think that onednn should be included in third_party automatically. My head hurts -.-
Which makes me think that onednn should be included in third_party automatically. My head hurts -.-
That was gonna be the next question --- why didn't we encounter this problem before? I am not sure what else is changing, but in our end, it seems nothing changed since the last build. Is there anything that could be related in here: https://github.com/pytorch/pytorch/compare/v2.0.0...v2.0.1
I found these: pytorch/pytorch#103745 pytorch/pytorch#97957
We can either:
Do you have a preference?
Which makes me think that onednn should be included in third_party automatically. My head hurts -.-
That was gonna be the next question --- why didn't we encounter this problem before? I am not sure what else is changing, but in our end, it seems nothing changed since the last build. Is there anything that could be related in here: pytorch/pytorch@v2.0.0...v2.0.1
As far as I can tell there is nothing in that diff messing with the building process.
Do you have a preference?
I would use the older DNN
OTOH the CPU builds work, and I believe it is due to this USE_MKLDNN here: https://github.com/conda-forge/pytorch-cpu-feedstock/blob/6b2d2b8b4fea88916ff6c9aef810216f0bb666e8/recipe/build_pytorch.sh#L128-L139 This branch only runs when cuda compiler is None, so maybe that makes it so oneDNN is not even needed.
Smart! Letâs add MKLDNN to the cuda ones too. I will edit the PR with this first, then the older onednn if it doesnât work
Btw, do you know if they added support for grace hopper gpus yet? If so, we may need to edit the compute list we have (TORCH_CUDA_ARCH_LIST
). I can double check once everything passes.
That is above my pay grade, dude. I only get up to 4090's, not complaining, though :P I learned how to build locally and tried setting the MKLDNN flag, still failed with the same error. I am now trying again but also removing onednn as a dependency in meta.yaml. Perhaps it is interfering somehow. I checked and pytorch comes with onednn via the submodule "ideep" https://github.com/intel/ideep/tree/fe8378249600442043b98f333b8b605bedca5a25 . It says mkl-dnn, but if you click it redirects you to onednn. Fun stuff.
Also, I believe your diff is not correct. pytorch@v2.0.0 points to the tag v2.0.0-rc6, which I bet is more recent than the tag that was used to build current v2.0.0, correct me if I am wrong. Judging by the dates in the feedstock we should be comparing these: https://github.com/pytorch/pytorch/compare/v2.0.0-rc4...v2.0.1
If this keeps failing, I will consider moving to testing cuda 12. Let me know if you can find anything useful in the diffs. I will take care of the newer gpus stuff đ
Building locally with MKLDNN=1 and removing onednn from dependencies yields the same error:
In file included from /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1687351558668/work/torch/csrc/jit/codegen/onednn/graph_fuser.h:3,
from /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1687351558668/work/torch/csrc/jit/codegen/onednn/graph_fuser.cpp:1:
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1687351558668/work/torch/csrc/jit/codegen/onednn/graph_helper.h:3:10: fatal error: oneapi/dnnl/dnnl_graph.hpp: No such file or directory
3 | #include <oneapi/dnnl/dnnl_graph.hpp>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
The file is located at third_party/ideep/mkl-dnn/include/oneapi/dnnl/dnnl_graph.hpp
However, this directory is not included when compiling that file:
$ cat log.txt | grep "graph_fuser.cpp" | grep "c++" | tail -1 | tr ' ' '\n' | grep '\-I' | cut -d/ -f8-
build/aten/src
aten/src
build
third_party/onnx
build/third_party/onnx
third_party/foxi
build/third_party/foxi
torch/csrc/api
torch/csrc/api/include
caffe2/aten/src/TH
build/caffe2/aten/src/TH
build/caffe2/aten/src
build/caffe2/../aten/src
torch/csrc
third_party/miniz-2.1.0
third_party/kineto/libkineto/include
aten/../third_party/catch/single_include
aten/src/ATen/..
third_party/FXdiv/include
c10/..
third_party/pthreadpool/include
third_party/cpuinfo/include
third_party/QNNPACK/include
aten/src/ATen/native/quantized/cpu/qnnpack/include
aten/src/ATen/native/quantized/cpu/qnnpack/src
third_party/cpuinfo/deps/clog/include
third_party/NNPACK/include
third_party/fbgemm/include
third_party/fbgemm
third_party/fbgemm/third_party/asmjit/src
third_party/FP16/include
third_party/tensorpipe
build/third_party/tensorpipe
third_party/tensorpipe/third_party/libnop/include
third_party/fmt/include
third_party/flatbuffers/include
The CMake output warns about this:
-- Will build oneDNN Graph
-- MKLDNN source files not found!
CMake Warning at cmake/Dependencies.cmake:1762 (message):
MKLDNN could not be found.
Call Stack (most recent call first):
CMakeLists.txt:717 (include)
The files are there in my local build_artifact and the FindMKLDNN.cmake file looks correct to me:
IF(NOT MKLDNN_FOUND)
SET(MKLDNN_LIBRARIES)
SET(MKLDNN_INCLUDE_DIR)
SET(IDEEP_ROOT "${PROJECT_SOURCE_DIR}/third_party/ideep")
SET(MKLDNN_ROOT "${PROJECT_SOURCE_DIR}/third_party/ideep/mkl-dnn/third_party/oneDNN")
IF(NOT APPLE AND NOT WIN32 AND NOT BUILD_LITE_INTERPRETER)
MESSAGE("-- Will build oneDNN Graph")
SET(LLGA_ROOT "${PROJECT_SOURCE_DIR}/third_party/ideep/mkl-dnn")
SET(BUILD_ONEDNN_GRAPH ON)
ENDIF(NOT APPLE AND NOT WIN32 AND NOT BUILD_LITE_INTERPRETER)
FIND_PACKAGE(BLAS)
FIND_PATH(IDEEP_INCLUDE_DIR ideep.hpp PATHS ${IDEEP_ROOT} PATH_SUFFIXES include)
FIND_PATH(MKLDNN_INCLUDE_DIR dnnl.hpp dnnl.h PATHS ${MKLDNN_ROOT} PATH_SUFFIXES include)
IF(NOT MKLDNN_INCLUDE_DIR)
EXECUTE_PROCESS(COMMAND git${CMAKE_EXECUTABLE_SUFFIX} submodule update --init mkl-dnn WORKING_DIRECTORY ${IDEEP_ROOT})
FIND_PATH(MKLDNN_INCLUDE_DIR dnnl.hpp dnnl.h PATHS ${MKLDNN_ROOT} PATH_SUFFIXES include)
ENDIF(NOT MKLDNN_INCLUDE_DIR)
IF(BUILD_ONEDNN_GRAPH)
FIND_PATH(LLGA_INCLUDE_DIR oneapi/dnnl/dnnl_graph.hpp PATHS ${LLGA_ROOT} PATH_SUFFIXES include)
ENDIF(BUILD_ONEDNN_GRAPH)
IF(NOT IDEEP_INCLUDE_DIR OR NOT MKLDNN_INCLUDE_DIR)
MESSAGE(STATUS "MKLDNN source files not found!")
RETURN()
ENDIF(NOT IDEEP_INCLUDE_DIR OR NOT MKLDNN_INCLUDE_DIR)
The dnnl.hpp files it is looking for are definitely there. And even if the submodule was not there for some reason, it tries to update it right there. The plot thickens...
Maybe we can set MKLDNN_INCLUDE_DIR?
Also, looks like the CI has gone further now? How are you building locally btw? Are you using the build_locally.py script?
Yes, I am using the build_locally.py script. Its fast because you can build only one thing and I can set MAX_JOBS to compile in parallel. Sucks because it clones pytorch every time -.-
Maybe we can set MKLDNN_INCLUDE_DIR?
Perhaps by replacing the pip build command with something like:
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py build --cmake-only
cmake build -DTHIS_AND_THAT
cd build && make all install
Otherwise AFAIK you cannot influence CMake variables.
Anyhow it looks like the CI was able to compile the offending files, its almost finished actually. Maybe we are lucky and thats it...
Donât hold your breath! The cuda builds wonât finish in 6 hours. What we are hoping for here is that they time outâŠ
Well damn⊠theyâre finishing in time. Could you test one of the gpu artifacts to see if they work fine locally? Let me know if you donât know how to do that
@conda-forge/pytorch-cpu @hmaarrfk @h-vetinari @Tobias-Fischer, since when are we able to build cuda112 on the CI here??? This can't be right!
Could you please have a look? Would you like me to incorporate the cuda12 migration in here too?
@RaulPPelaez I will add you as a coauthor to commits in this PR if you don't mind. Please object if you don't want me to do that.
@conda-forge/pytorch-cpu @hmaarrfk @h-vetinari @Tobias-Fischer, since when are we able to build cuda112 on the CI here??? This can't be right!
Could you please have a look? Would you like me to incorporate the cuda12 migration in here too?
I should have checked the logs more carefully:
-- Could NOT find CUDA (missing: CUDA_INCLUDE_DIRS) (found version "11.2")
CMake Warning at cmake/public/cuda.cmake:31 (message):
Caffe2: CUDA cannot be found. Depending on whether you are building Caffe2
or a Caffe2 dependent library, the next warning / error will give you more
info.
Call Stack (most recent call first):
cmake/Dependencies.cmake:43 (include)
CMakeLists.txt:717 (include)
CMake Warning at cmake/Dependencies.cmake:66 (message):
Not compiling with CUDA. Suppress this warning with -DUSE_CUDA=OFF.
Call Stack (most recent call first):
CMakeLists.txt:717 (include)
The local build looks fine to me. The docker container appears to provide CUDA in /usr/local/cuda. I do not see that message about "CUDA not found". The only difference between your last commit and my local version is that I did not introduced this line in the CUDA branch in the build script:
export CMAKE_TOOLCHAIN_FILE="${RECIPE_DIR}/cross-linux.cmake"
I donât know how to fix the problem, but we should check whether we can catch this somehow in a test case. Itâs problematic that the build still passes.
In my local build, adding the CMAKE_TOOLCHAIN_FILE makes CMake not pick CUDA up. By removing it the error goes away.
Not sure how to go about catching this particular mistake.
What's the status of this? Confusingly for me, I see pytorch 2.0.1
when I run mamba search
but the pyro-ppl
feedstock is failing its test because it can't satisfy the requirement. I've been waiting on this PR before I update the feed to the latest version.
The CUDA builds are giving us a hard time. For some reason CMake is not finding CUDA correctly and we ran out of ideas. Mysteriously a local build used to pass for me, let me see if the latest commit here still does and I will give it another spin.
Check out the channels in your env, in a base env I only see up to 2.0.0 in conda-forge:
$ mamba search pytorch
...
pytorch 1.13.1 cuda112py39hb0b7ed5_200 conda-forge
pytorch 2.0.0 cpu_py310hd11e9c7_0 conda-forge
pytorch 2.0.0 cpu_py311h410fd25_0 conda-forge
pytorch 2.0.0 cpu_py38h019455c_0 conda-forge
pytorch 2.0.0 cpu_py39he4d1dc0_0 conda-forge
pytorch 2.0.0 cuda112py310he33e0d6_200 conda-forge
pytorch 2.0.0 cuda112py311h13fee9e_200 conda-forge
pytorch 2.0.0 cuda112py38h5e67e12_200 conda-forge
pytorch 2.0.0 cuda112py39ha9981d0_200 conda-forge
Note that the pytorch channel offers 2.0.1.
@conda-forge-admin, please rerender
Check out the channels in your env, in a base env I only see up to 2.0.0 in conda-forge:
Ah yeah I'm actually seeing it in pkgs/main
, I should remove that channel...
Note that the pytorch channel offers 2.0.1.
Yeah it is definitely available there and I assume most people install it from that channel. But for the purposes of the conda-forge feedstock I like to make sure everything is available before updating. Those that need the latest can always use pip
.
By looking here: https://github.com/pytorch/pytorch/blob/0ab74044c2775970d3bc3668454a3152ae18ea82/.ci/docker/common/install_conda.sh#L54 It seems like a particular version of mkl is hardcoded This dockerfile is used to build with CUDA: https://github.com/pytorch/pytorch/blob/main/.ci/docker/ubuntu-cuda/Dockerfile If one follows the breadcrumb trail I believe the MKLDNN variable is set to 0. With these changes from the latest commit, a11a83c, a local build succeeds for me
modified recipe/build_pytorch.sh
@@ -125,7 +125,7 @@ if [[ ${cuda_compiler_version} != "None" ]]; then
export USE_STATIC_CUDNN=0
export CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME
export MAGMA_HOME="${PREFIX}"
- export USE_MKLDNN=1
+ export USE_MKLDNN=0
else
if [[ "$target_platform" == *-64 ]]; then
export BLAS="MKL"
modified recipe/meta.yaml
@@ -79,8 +79,8 @@ outputs:
- requests
- future
- six
- - mkl-devel {{ mkl }} # [x86]
- - mkl-include {{ mkl }} # [x86]
+ - mkl==2021.4.0 # [x86]
+ - mkl-include==2021.4.0 # [x86]
- libcblas * *_mkl # [x86]
- libcblas # [not x86]
- liblapack # [not x86]
The test then fails with:
import: 'torch'
Traceback (most recent call last):
File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/test_tmp/run_test.py", line 2, in <module>
import torch
File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla
cehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.10/site-packages/torch/__init__.py", line 228, in <module>
_load_global_deps()
File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla
cehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.10/site-packages/torch/__init__.py", line 187, in _load_global_deps
raise err
File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla
cehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.10/site-packages/torch/__init__.py", line 168, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1690209346820/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla
cehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.10/ctypes/__init__.py", line 374, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libmkl_intel_lp64.so.1: cannot open shared object file: No such file or directory
See #176 for updates
At this point I think that going directly for 2.1.0 https://github.com/pytorch/pytorch/releases/tag/v2.1.0
Alas the problems I had compiling this one are probably still there.
Looking at the build instructions, it seems like they are somewhat simpler than before, though: https://github.com/pytorch/pytorch#from-source
i'm not too sure what changed in 4 months but seems like something got fixed https://github.com/conda-forge/pytorch-cpu-feedstock/pull/199
working on getting through this.
CPU build passed, so now onto everything else....
Do we still want this given that PR ( https://github.com/conda-forge/pytorch-cpu-feedstock/pull/195 ) added 2.1.0?
Checklist
0
(if the version changed)conda-smithy
(Use the phrase code>@<space/conda-forge-admin, please rerender in a comment in this PR for automated rerendering)