dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.3k stars 474 forks source link

Build xformers fail #282

Closed jasl closed 1 year ago

jasl commented 1 year ago

I'm trying to build my own stable-diffusion-webui image on AGX Orin (a fresh jetpack 5.1.2 installation),

./build.sh stable-diffusion-webui then I got

docker run -t --rm --runtime=nvidia --network=host \
--volume /home/jasl/Workspaces/jetson-containers/packages/llm/transformers:/test \
--volume /home/jasl/Workspaces/jetson-containers/data:/data \
--workdir /test \
stable-diffusion-webui:r35.4.1-transformers \
/bin/bash -c 'python3 huggingface-benchmark.py' \
2>&1 | tee /home/jasl/Workspaces/jetson-containers/logs/20230911_191408/test/stable-diffusion-webui_r35.4.1-transformers_huggingface-benchmark.py.txt; exit ${PIPESTATUS[0]}

Namespace(model='distilgpt2', precision='fp16', prompt='Once upon a time,', runs=2, save='', token='', tokens=[128], warmup=2)
Running on device cuda:0
Input tokens: tensor([[7454, 2402,  257,  640,   11]], device='cuda:0') shape: torch.Size([1, 5])
Loading model distilgpt2 (fp16)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.7
CUDA SETUP: Detected CUDA version 114
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda114.so...
Traceback (most recent call last):
  File "huggingface-benchmark.py", line 71, in <module>
    model = AutoModelForCausalLM.from_pretrained(args.model, **kwargs) #AutoModelForCausalLM.from_pretrained(args.model, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2347, in from_pretrained
    if is_fsdp_enabled():
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 118, in is_fsdp_enabled
    return torch.distributed.is_initialized() and strtobool(os.environ.get("ACCELERATE_USE_FSDP", "False")) == 1
AttributeError: module 'torch.distributed' has no attribute 'is_initialized'
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jasl/Workspaces/jetson-containers/jetson_containers/build.py", line 95, in <module>
    build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.test_only, args.push)
  File "/home/jasl/Workspaces/jetson-containers/jetson_containers/container.py", line 135, in build_container
    test_container(container_name, pkg, simulate)
  File "/home/jasl/Workspaces/jetson-containers/jetson_containers/container.py", line 307, in test_container
    status = subprocess.run(cmd.replace(_NEWLINE_, ' '), executable='/bin/bash', shell=True, check=True)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'docker run -t --rm --runtime=nvidia --network=host --volume /home/jasl/Workspaces/jetson-containers/packages/llm/transformers:/test --volume /home/jasl/Workspaces/jetson-containers/data:/data --workdir /test stable-diffusion-webui:r35.4.1-transformers /bin/bash -c 'python3 huggingface-benchmark.py' 2>&1 | tee /home/jasl/Workspaces/jetson-containers/logs/20230911_191408/test/stable-diffusion-webui_r35.4.1-transformers_huggingface-benchmark.py.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.
dusty-nv commented 1 year ago

Hi @jasl , can you try switching to the dev branch of jetson-containers?

Commit https://github.com/dusty-nv/jetson-containers/commit/712252b39835573a8a18bcafecc9d9bb64605c11 is staged there, which patches this issue introduced by recent transformers update.

# "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 118
# AttributeError: module 'torch.distributed' has no attribute 'is_initialized'
RUN PYTHON_ROOT=`pip3 show transformers | grep Location: | cut -d' ' -f2` && \
    sed -i 's|torch.distributed.is_initialized|torch.distributed.is_available|g' -i ${PYTHON_ROOT}/transformers/modeling_utils.py
jasl commented 1 year ago

Hi @jasl , can you try switching to the dev branch of jetson-containers?

Commit 712252b is staged there, which patches this issue introduced by recent transformers update.

# "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 118
# AttributeError: module 'torch.distributed' has no attribute 'is_initialized'
RUN PYTHON_ROOT=`pip3 show transformers | grep Location: | cut -d' ' -f2` && \
    sed -i 's|torch.distributed.is_initialized|torch.distributed.is_available|g' -i ${PYTHON_ROOT}/transformers/modeling_utils.py

Sorry for late response

I tried to use dev branch, and the error move to

#7 [3/5] RUN cd /opt &&     git clone --branch master --depth=1 https://github.com/AUTOMATIC1111/stable-diffusion-webui &&     cd stable-diffusion-webui &&     git clone https://github.com/dusty-nv/stable-diffusion-webui-tensorrt extensions-builtin/stable-diffusion-webui-tensorrt &&     python3 -c 'from modules import launch_utils; launch_utils.args.skip_python_version_check=True; launch_utils.prepare_environment()'
#7 0.096 Cloning into 'stable-diffusion-webui'...
#7 2.022 Cloning into 'extensions-builtin/stable-diffusion-webui-tensorrt'...
#7 5.522 Traceback (most recent call last):
#7 5.522   File "<string>", line 1, in <module>
#7 5.522   File "/opt/stable-diffusion-webui/modules/launch_utils.py", line 356, in prepare_environment
#7 5.522     raise RuntimeError(
#7 5.522 RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check
#7 5.522 Python 3.8.10 (default, May 26 2023, 14:05:08)
#7 5.522 [GCC 9.4.0]
#7 5.522 Version: v1.6.0
#7 5.522 Commit hash: 5ef669de080814067961f28357256e8fe27544f4
#7 ERROR: process "/bin/sh -c cd /opt &&     git clone --branch ${STABLE_DIFFUSION_WEBUI_VERSION} --depth=1 https://github.com/${STABLE_DIFFUSION_WEBUI_REPO} &&     cd stable-diffusion-webui &&     git clone https://github.com/dusty-nv/stable-diffusion-webui-tensorrt extensions-builtin/stable-diffusion-webui-tensorrt &&     python3 -c 'from modules import launch_utils; launch_utils.args.skip_python_version_check=True; launch_utils.prepare_environment()'" did not complete successfully: exit code: 1
------
 > [3/5] RUN cd /opt &&     git clone --branch master --depth=1 https://github.com/AUTOMATIC1111/stable-diffusion-webui &&     cd stable-diffusion-webui &&     git clone https://github.com/dusty-nv/stable-diffusion-webui-tensorrt extensions-builtin/stable-diffusion-webui-tensorrt &&     python3 -c 'from modules import launch_utils; launch_utils.args.skip_python_version_check=True; launch_utils.prepare_environment()':
2.022 Cloning into 'extensions-builtin/stable-diffusion-webui-tensorrt'...
5.522 Traceback (most recent call last):
5.522   File "<string>", line 1, in <module>
5.522   File "/opt/stable-diffusion-webui/modules/launch_utils.py", line 356, in prepare_environment
5.522     raise RuntimeError(
5.522 RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check
5.522 Python 3.8.10 (default, May 26 2023, 14:05:08)
5.522 [GCC 9.4.0]
5.522 Version: v1.6.0
5.522 Commit hash: 5ef669de080814067961f28357256e8fe27544f4
------
Dockerfile:17
--------------------
  16 |
  17 | >>> RUN cd /opt && \
  18 | >>>     git clone --branch ${STABLE_DIFFUSION_WEBUI_VERSION} --depth=1 https://github.com/${STABLE_DIFFUSION_WEBUI_REPO} && \
  19 | >>>     cd stable-diffusion-webui && \
  20 | >>>     git clone https://github.com/dusty-nv/stable-diffusion-webui-tensorrt extensions-builtin/stable-diffusion-webui-tensorrt && \
  21 | >>>     python3 -c 'from modules import launch_utils; launch_utils.args.skip_python_version_check=True; launch_utils.prepare_environment()'
  22 |
--------------------
ERROR: failed to solve: process "/bin/sh -c cd /opt &&     git clone --branch ${STABLE_DIFFUSION_WEBUI_VERSION} --depth=1 https://github.com/${STABLE_DIFFUSION_WEBUI_REPO} &&     cd stable-diffusion-webui &&     git clone https://github.com/dusty-nv/stable-diffusion-webui-tensorrt extensions-builtin/stable-diffusion-webui-tensorrt &&     python3 -c 'from modules import launch_utils; launch_utils.args.skip_python_version_check=True; launch_utils.prepare_environment()'" did not complete successfully: exit code: 1
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jasl/Workspaces/jetson-containers/jetson_containers/build.py", line 95, in <module>
    build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.test_only, args.push)
  File "/home/jasl/Workspaces/jetson-containers/jetson_containers/container.py", line 128, in build_container
    status = subprocess.run(cmd.replace(_NEWLINE_, ' '), executable='/bin/bash', shell=True, check=True)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'docker build --network=host --tag stable-diffusion-webui:r35.4.1-stable-diffusion-webui --file /home/jasl/Workspaces/jetson-containers/packages/diffusion/stable-diffusion-webui/Dockerfile --build-arg BASE_IMAGE=stable-diffusion-webui:r35.4.1-opencv /home/jasl/Workspaces/jetson-containers/packages/diffusion/stable-diffusion-webui 2>&1 | tee /home/jasl/Workspaces/jetson-containers/logs/20230912_010054/build/stable-diffusion-webui_r35.4.1-stable-diffusion-webui.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.
dusty-nv commented 1 year ago

RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

Do you have your default docker runtime set to nvidia? https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md#docker-default-runtime

If so, my guess is that stable-diffusion-webui was updated and installs a different pytorch wheel that wasn't built with CUDA. I will kick off a workflow to check it...

jasl commented 1 year ago

https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md#docker-default-runtime

Ah, sorry! I did a quick check. It's not using Nvidia runtime... I just reset my AGX Orin by JetPack SDK Manager.

I corrected the setting, but rerun I got the same error, I've cleaned the build cache and retrying.

dusty-nv commented 1 year ago

If you haven't already, after you modify /etc/docker/daemon.json, you need to either reboot or restart the docker service:

sudo systemctl restart docker
jasl commented 1 year ago

https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md#docker-default-runtime

I did, after restart I still got the same error.

dusty-nv commented 1 year ago

Hmm... it just re-built successfully here: https://github.com/dusty-nv/jetson-containers/actions/runs/6149838054/job/16686510903

You aren't using buildkit right?

jasl commented 1 year ago

Hmm... it just re-built successfully here: https://github.com/dusty-nv/jetson-containers/actions/runs/6149838054/job/16686510903

You aren't using buildkit right?

Ah yes, I added buildx and compose plugin

jasl commented 1 year ago

It seems buildx cause the trouble, I removed it and retry, sorry for waste your time.

jasl commented 1 year ago

May I ask an off-topic question? I'm actually trying to run https://github.com/vladmandic/automatic on the AGX Orin. This is an A111 web-ui fork. In my experience, it's more stable and has better diffusers and SDXL support.

but it's dropped Python 3.8 support, and Nvidia only provides cp38 prebuilt wheel. Is it possible to compile PyTorch on a higher Python version?

dusty-nv commented 1 year ago

@jasl yes it should be possible to recompile PyTorch, PyTorch >= 2.0 doesn't require patches to build for ARM64.

However in my experience, you can simply do some patches to the setup.py/ect of a project, and it will still work on Python 3.8...

actually I already have to do this in my dockerfile for stable-diffusion-webui:

https://github.com/dusty-nv/jetson-containers/blob/47aa733d9a1c21b08e9333a627718f98d733539c/packages/diffusion/stable-diffusion-webui/Dockerfile#L26

jasl commented 1 year ago

@jasl yes it should be possible to recompile PyTorch, PyTorch >= 2.0 doesn't require patches to build for ARM64.

However in my experience, you can simply do some patches to the setup.py/ect of a project, and it will still work on Python 3.8...

actually I already have to do this in my dockerfile for stable-diffusion-webui:

https://github.com/dusty-nv/jetson-containers/blob/47aa733d9a1c21b08e9333a627718f98d733539c/packages/diffusion/stable-diffusion-webui/Dockerfile#L26

I see, thank you!

jasl commented 1 year ago

So the conclusion isdev branch works but it has trouble when Docker is buildx. I heard Jetpack 6 will coming soon and I'm not sure buildx will become the default in Ubuntu 22.04 packaged Docker

dusty-nv commented 1 year ago

buildx / buildkit doesn't honor the default docker-runtime in /etc/docker/daemon.json. hence for container builds that require CUDA runtime, it won't work