[Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5

pseudotensor commented 1 month ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Text handling by the model is fine, but any image leads to crash

Reproduction

Just do any single image request as one would for other vision models. This is sufficient to cause crash every time:

from openai import OpenAI

client = OpenAI(base_url='http://<fill_IP>/v1')

from PIL import Image
import base64
import requests
from io import BytesIO

# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

# load image from url
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))

# correct way to encode an image from url
response = requests.get(url)
base64_correct = base64.b64encode(response.content).decode('utf-8')

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What do you see?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": 'data:image/jpeg;base64,' + base64_correct,
                },
            },
        ],
    }
]

response = client.chat.completions.create(
    model="OpenGVLab/InternVL2-Llama3-76B",
    messages=messages,
    temperature=0.0,
    max_tokens=300,
)

print(response.choices[0])

I haven't used the latest lmdeploy for other models like internVL 1-5 that work fine with older version, so it's possible those are broken too. I'll try InternVL 1-5 to see if lmdeploy is generally broken.

Environment

Using latest docker w/ extra build for vision stuff on 4*H100

docker stop internvl2_llama3_76b_lmdeploy ; docker remove internvl2_llama3_76b_lmdeploy
docker run -d --restart=always --runtime nvidia --gpus '"device=0,1,2,3"' \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN" \
    -p 23343:23333 \
    --ipc=host \
    --name internvl2_llama3_76b_lmdeploy \
    internvlmain2 \
    lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B \
    --tp 4 \
    --model-name OpenGVLab/InternVL2-Llama3-76B

See how docker image is built here: https://github.com/InternLM/lmdeploy/issues/2163

Error traceback

INFO:     172.16.0.83:34664 - "POST /v1/completions HTTP/1.1" 200 OK
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED /opt/lmdeploy/src/turbomind/utils/cublasMMWrapper.cc:307 

terminate called recursively
terminate called recursively
terminate called recursively
[dfe606afa87e:1    :0:414] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
INFO:     172.16.0.83:14234 - "GET /health HTTP/1.1" 200 OK

pseudotensor commented 1 month ago

If I try pytorch backend, I get on startup:

language_model.model.layers.79.input_layernorm.weight:  54%|?2024-07-26 20:45:57,609 - lmdeploy - ERROR - RuntimeError: Internal Triton PTX codegen error: 
ptxas fatal   : Value 'sm_90a' is not defined for option 'gpu-name'

2024-07-26 20:45:57,609 - lmdeploy - ERROR - <Triton> test failed!
Please ensure it has been installed correctly.

Same for InternVL 1-5 model.

My nvidia-smi if helpful:

Fri Jul 26 20:51:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0             119W / 700W |  67915MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:2D:00.0 Off |                    0 |
| N/A   36C    P0             120W / 700W |  68135MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:44:00.0 Off |                    0 |
| N/A   33C    P0             124W / 700W |  68135MiB / 81559MiB |      1%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:5B:00.0 Off |                    0 |
| N/A   37C    P0             124W / 700W |  67557MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:89:00.0 Off |                    0 |
| N/A   31C    P0             114W / 700W |  71661MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:A8:00.0 Off |                    0 |
| N/A   36C    P0             118W / 700W |  78617MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:C0:00.0 Off |                    0 |
| N/A   37C    P0             123W / 700W |  80371MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:D8:00.0 Off |                   On |
| N/A   32C    P0             120W / 700W |  63356MiB / 81559MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  7    1   0   0  |           32519MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               3MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  7    2   0   1  |           30837MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               3MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2083468      C   /opt/py38/bin/python3                     67908MiB |
|    1   N/A  N/A   2083468      C   /opt/py38/bin/python3                     68128MiB |
|    2   N/A  N/A   2083468      C   /opt/py38/bin/python3                     68128MiB |
|    3   N/A  N/A   2083468      C   /opt/py38/bin/python3                     67550MiB |
|    4   N/A  N/A   1172218      C   python3                                   71642MiB |
|    5   N/A  N/A   1171696      C   python3                                   78600MiB |
|    6   N/A  N/A   1181117      C   /opt/py38/bin/python3                     80364MiB |
|    7    1    0    1465874      C   python                                    32484MiB |
|    7    2    0    1475745      C   python3                                   30802MiB |
+---------------------------------------------------------------------------------------+

This is after the model is already on the 0-3 GPUs

But some suggest cuda 11.8 is fine for sm_90: https://discuss.pytorch.org/t/cuda-version-conundrum/185714/3

pseudotensor commented 1 month ago

Ah, even OpenGVLab/InternVL-Chat-V1-5 segfaults.

So lmdeploy is broken somehow, because I'm using the same docker build scripts as I used for already-running cases just fine, only difference is using latest lmdeploy repo hash

pseudotensor commented 1 month ago

I'm also confused by this in readme:

Since v0.3.0, The default prebuilt package is compiled on CUDA 12. However, if CUDA 11+ is required, you can install lmdeploy by

But the docker/Dockerfile still references cu118 and (I guess) uses tritonserver that only as cuda 11.8.

Is this a problem for deploying on H100? It's worked on lmdeploy from (maybe) 2-3 weeks ago, so I guess not, but maybe the pytorch issue is related.

pseudotensor commented 1 month ago

Why can't the docker image use updated triton server image?

https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-06.html#rel-24-06

that uses cuda 12.5?

And why is the triton server image used as base image at all? Seems overly complicaetd and you don't even use the triton server. Why not just normal Ubuntu with python 3.10?

pseudotensor commented 1 month ago

The exact same build process but on f6138148aaf30ce019e38c9f4295d00e8ca4d66d works fine, no segfault, so definitely a regression.

RunningLeon commented 1 month ago

@pseudotensor hi, thanks for your feedback. Looks like it only happens in docker image of triton-serve based. https://github.com/InternLM/lmdeploy/pull/1971 can fix it. As for docker image, we will consider provide a docker image of cuda12.x later.

RunningLeon commented 1 month ago

@pseudotensor hi, could you kindly try on this updated dockerfile from https://github.com/InternLM/lmdeploy/pull/2182? Any feedback would be greatly appreciated.

lvhan028 commented 1 month ago

Why can't the docker image use updated triton server image?

https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-06.html#rel-24-06

that uses cuda 12.5?

And why is the triton server image used as base image at all? Seems overly complicaetd and you don't even use the triton server. Why not just normal Ubuntu with python 3.10?

The initial version of lmdeploy inherits FasterTransformer and triton inference server. With the development of lmdeploy, we are gradually removing them #1986 The cu12 docker image (#2182) won't be released until full test passes

pseudotensor commented 1 month ago

Still hitting segfaults, unsure same issue: https://github.com/InternLM/lmdeploy/issues/2223

Probably same, so not fixed.

lvhan028 commented 1 month ago

Hi, @pseudotensor we tried A100 (*8) but were not able to reproduce this issue. I was wondering if there is any way that we can access your env and debugging it?

pseudotensor commented 1 month ago

Hi I plan to do the debugging thing: https://github.com/InternLM/lmdeploy/issues/2223#issuecomment-2266335124

Just busy with other stuff.

I'm unable to give access to the machine directly, but we can do a shared debugging session if that's helpful. You can email me at pseudotensor@gmail.com to setup details

pseudotensor commented 3 weeks ago

https://github.com/InternLM/lmdeploy/issues/2223#issuecomment-2290547405

InternLM / lmdeploy