dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
1.88k stars 416 forks source link

Building container failing on Numpy versions #561

Open JoostdeK opened 1 week ago

JoostdeK commented 1 week ago

I couldnt for the life of me get the command:

jetson-containers build whisper_trt nano_llm --name xyz to work. I saw alot of Numpy errors during the build, and later it would fail on it on the onnyxruntime step.

These were some of the errors:

Using pip 24.0 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Looking in indexes: http://jetson.webredirect.org/jp6/cu122, https://pypi.ngc.nvidia.com
Processing /opt/onnxruntime_gpu-1.17.0-cp310-cp310-linux_aarch64.whl
Requirement already satisfied: coloredlogs in /usr/local/lib/python3.10/dist-packages (from onnxruntime-gpu==1.17.0) (15.0.1)
Requirement already satisfied: flatbuffers in /usr/local/lib/python3.10/dist-packages (from onnxruntime-gpu==1.17.0) (24.3.25)
Requirement already satisfied: numpy>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from onnxruntime-gpu==1.17.0) (2.0.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from onnxruntime-gpu==1.17.0) (24.1)
Requirement already satisfied: protobuf in /usr/local/lib/python3.10/dist-packages (from onnxruntime-gpu==1.17.0) (5.27.1)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from onnxruntime-gpu==1.17.0) (1.12.1)
Requirement already satisfied: humanfriendly>=9.1 in /usr/local/lib/python3.10/dist-packages (from coloredlogs->onnxruntime-gpu==1.17.0) (10.0)
Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->onnxruntime-gpu==1.17.0) (1.3.0)
onnxruntime-gpu is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.
+ python3 -c 'import onnxruntime; print(onnxruntime.__version__);'

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/__init__.py", line 23, in <module>
    from onnxruntime.capi._pybind_state import ExecutionMode  # noqa: F401
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/_pybind_state.py", line 32, in <module>
    from .onnxruntime_pybind11_state import *  # noqa
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-pack ages/numpy/core/_multiarray_umath.py", line 44, in __getattr__
    raise ImportError(msg)
ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

I searched for this and found this issue: https://github.com/pytorch/pytorch/issues/128860#issuecomment-2175641041

Here it was mentioned that setting numpy to 1.26.4 fixed that issue. I adjusted the Numpy dockerfile in jetson-containers to:

RUN pip3 install --upgrade --no-cache-dir --verbose numpy==v1.26.4 && \
    pip3 show numpy && python3 -c 'import numpy; print(numpy.__version__)' 

That solved the build issue for me. No idea yet how it runs tho lol.

dusty-nv commented 1 week ago

@JoostdeK yes, numpy 2.0 was released Sunday and immediately started breaking things in the builds lol...

I previously patched a few dockerfiles including NanoLLM to lock it to numpy<2, like in commit https://github.com/dusty-nv/jetson-containers/commit/b19854a9f02ad4e2658679475b5c3c9fec41a375

Can you recall if you were running updated jetson-containers before trying that build? I am curious if I should just carte-blanch pin it in the underlying numpy dockerfile or not, or only where needed. I was also having an issue where some packages like scipy were upgrading it to numpy2 later in the builds.

I will try it like you did in the numpy dockerfile itself, and see if those builds pass. I think until a lot of the major ML packages catch up there will be a lot of these errors occurring.

JoostdeK commented 1 week ago

Sadly no, have never build one before, only got the orin nano last week haha. But I figured it should work so that why I spend some time. I think the commit you mentioned for NanoLLM also lead me to try this. So sadly since I'm not familiar with more of the code yet, I have no more advice haha.

Edit: I did remove all docker images and system prune so that nothing was cached. That was before I found the option for build flags =)

eufrizz commented 1 week ago

If anyone else is having this issue when building OpenCV, I successfully patched by adding this line in packages/opencv/install.sh:

diff --git a/packages/opencv/install.sh b/packages/opencv/install.sh
index 480ba8c1..86f68ea7 100755
--- a/packages/opencv/install.sh
+++ b/packages/opencv/install.sh
@@ -19,5 +19,6 @@ else
     $ROOT/install_pip.sh
 fi

+python3 -m pip install --force-reinstall 'scipy<1.13' 'numpy<2'
 python3 -c "import cv2; print('OpenCV version:', str(cv2.__version__)); print(cv2.getBuildInformation())"

EDIT: this was actually unnecessary, simply specifying the version in packages/numpy/Dockerfile as the OP did, did the trick, I just didn't know my way around this repo before.

dusty-nv commented 1 week ago

Thanks @eufrizz - sadly the experiment early of installing numpy<2 in the numpy dockerfile did not work, because packages in later containers install numpy2 (like scipy). Until all the downstream dependencies catch up, not sure how to fix this right now in an automated way without the manual patches in the other dockerfiles. Can't imagine we are the only ones feeling the pain 🤣

JoostdeK commented 1 week ago

@dusty-nv Which build command failed? I'll try too.

johnnynunez commented 1 week ago

Put here all content that fails, for sure external libraries like onnxruntime, scipy etc. Pytorch, opencv, tensorflow is already compatible with numpy 2.0,

Dronakurl commented 1 week ago

Having a similar problem. When I try to build this container,

jetson-containers build --name=torchbase pytorch opencv python:3.12 ffmpeg numpy torchvision

I get an error:

Requirement already satisfied: numpy>=1.21.2 in /usr/local/lib/python3.10/dist-packages (from opencv-contrib-python==4.8.1.84) (2.0.0)
Installing collected packages: opencv-contrib-python
  Attempting uninstall: opencv-contrib-python
    Found existing installation: opencv-contrib-python 4.8.1.80
    Uninstalling opencv-contrib-python-4.8.1.80:
      Removing file or directory /usr/local/lib/python3.10/dist-packages/cv2/
      Removing file or directory /usr/local/lib/python3.10/dist-packages/opencv_contrib_python-4.8.1.80.dist-info/
      Successfully uninstalled opencv-contrib-python-4.8.1.80
Successfully installed opencv-contrib-python-4.8.1.84
+ python3 -c 'import cv2; print('\''OpenCV version:'\'', str(cv2.__version__)); print(cv2.getBuildInformation())'

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
AttributeError: _ARRAY_API not found
Traceback (most recent call last):

Is there a chance that this is the same issue?

johnnynunez commented 1 week ago

Having a similar problem. When I try to build this container,

jetson-containers build --name=torchbase pytorch opencv python:3.12 ffmpeg numpy torchvision

I get an error:

Requirement already satisfied: numpy>=1.21.2 in /usr/local/lib/python3.10/dist-packages (from opencv-contrib-python==4.8.1.84) (2.0.0)
Installing collected packages: opencv-contrib-python
  Attempting uninstall: opencv-contrib-python
    Found existing installation: opencv-contrib-python 4.8.1.80
    Uninstalling opencv-contrib-python-4.8.1.80:
      Removing file or directory /usr/local/lib/python3.10/dist-packages/cv2/
      Removing file or directory /usr/local/lib/python3.10/dist-packages/opencv_contrib_python-4.8.1.80.dist-info/
      Successfully uninstalled opencv-contrib-python-4.8.1.80
Successfully installed opencv-contrib-python-4.8.1.84
+ python3 -c 'import cv2; print('\''OpenCV version:'\'', str(cv2.__version__)); print(cv2.getBuildInformation())'

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
AttributeError: _ARRAY_API not found
Traceback (most recent call last):

Is there a chance that this is the same issue?

for opencv with numpy 2.0 is necessary to be 4.10.0.84

JoostdeK commented 1 week ago

Having a similar problem. When I try to build this container,

jetson-containers build --name=torchbase pytorch opencv python:3.12 ffmpeg numpy torchvision

I get an error:

Requirement already satisfied: numpy>=1.21.2 in /usr/local/lib/python3.10/dist-packages (from opencv-contrib-python==4.8.1.84) (2.0.0)
Installing collected packages: opencv-contrib-python
  Attempting uninstall: opencv-contrib-python
    Found existing installation: opencv-contrib-python 4.8.1.80
    Uninstalling opencv-contrib-python-4.8.1.80:
      Removing file or directory /usr/local/lib/python3.10/dist-packages/cv2/
      Removing file or directory /usr/local/lib/python3.10/dist-packages/opencv_contrib_python-4.8.1.80.dist-info/
      Successfully uninstalled opencv-contrib-python-4.8.1.80
Successfully installed opencv-contrib-python-4.8.1.84
+ python3 -c 'import cv2; print('\''OpenCV version:'\'', str(cv2.__version__)); print(cv2.getBuildInformation())'

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
AttributeError: _ARRAY_API not found
Traceback (most recent call last):

Is there a chance that this is the same issue?

That is the same issue. I just tried it, it builds on my pc. That is with the solution that worked for me. But Dusty mentioned it doesnt work for him. As I mentioned in my post, I did clear all of docker of any and all lingering images and caches. Could be pulling something from cache with numpy > 2?

dusty-nv commented 1 week ago

As I mentioned in my post, I did clear all of docker of any and all lingering images and caches. Could be pulling something from cache with numpy > 2?

@JoostdeK it could be that the container stack you were building did not have other packages installed during the build which upgraded numpy later. For example, right now anytime scipy gets installed, it wants to auto-upgrade numpy to numpy2. And in the nano_llm build I tried, scipy gets installed at some point (which is why I had to pip3 install --force-reinstall 'scipy<1.13' 'numpy<2' instead of just numpy<2, because scipy<1.13 is before it started depending on numpy2)

I'm not going to exhaustively go through each Dockerfile and temporarily try/patch all instances where numpy needs pinned - as @johnnynunez pointed out, fortunately some packages have already started catching up. Let's continue posting the ones with issues to this thread and selectively patch them as needed - and for now, I have committed the patch to the main numpy dockerfile (this is in jetson-containers dev branch in https://github.com/dusty-nv/jetson-containers/commit/4c8c306739af22627525df1b6e714d57567b9c45)

As mentioned, downstream pip installs can still override this (which in some cases seems desirable if the likes of pytorch, opencv, scipy, ect need numpy2). The challenges come in when packages in the same container are incompatible due to one of them needing numpy2 while others need numpy<2.