deephaven / deephaven-core

Deephaven Community Core
Other
257 stars 80 forks source link

SIGSEGV when instantiating PyTorch model #5933

Open alexpeters1208 opened 3 months ago

alexpeters1208 commented 3 months ago

I can consistently segfault the server in Docker and Pip by running the following code. It requires the Python packages torch and transformers.

from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

and then running the last line again:

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Error:

docker-deephaven-1  | #
docker-deephaven-1  | # A fatal error has been detected by the Java Runtime Environment:
docker-deephaven-1  | #
docker-deephaven-1  | #  SIGSEGV (0xb) at pc=0x0000ffff11b7255c, pid=1, tid=189
docker-deephaven-1  | #
docker-deephaven-1  | # JRE version: OpenJDK Runtime Environment Temurin-21.0.4+7 (21.0.4+7) (build 21.0.4+7-LTS)
docker-deephaven-1  | # Java VM: OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (21.0.4+7-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
docker-deephaven-1  | # Problematic frame:
docker-deephaven-1  | # C  [libpython3.10.so+0x1b255c]
docker-deephaven-1  | #
docker-deephaven-1  | # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
docker-deephaven-1  | #
docker-deephaven-1  | # An error report file with more information is saved as:
docker-deephaven-1  | # //hs_err_pid1.log
docker-deephaven-1  | [45.206s][warning][os] Loading hsdis library failed
docker-deephaven-1  | #
docker-deephaven-1  | # If you would like to submit a bug report, please visit:
docker-deephaven-1  | #   https://github.com/adoptium/adoptium-support/issues
docker-deephaven-1  | # The crash happened outside the Java Virtual Machine in native code.
docker-deephaven-1  | # See problematic frame for where to report the bug.
docker-deephaven-1  | #
docker-deephaven-1  | 
docker-deephaven-1  | [error occurred during error reporting (), id 0x5, SIGTRAP (0x5) at pc=0x0000ffff950771ec]

This does not happen in a plain Python console. It also does not happen if I run the last line twice in the same script:

import time

from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

time.sleep(5)

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

I'm on latest for Docker, which is 0.35.2. I observe the same behavior on edge. Also, maybe importantly, I'm on Apple Silicon. @jjbrosnan tried to repro on X86. He could repro on latest, but not on edge. When he attempted on edge, his DH server hung at the same point where mine crashed.

alexpeters1208 commented 3 months ago

Evaluating pip list for the minimum Docker env where this issue arises:

docker-deephaven-1  | 2024-08-13T20:41:05.037Z | r-Scheduler-Serial-1 |  INFO | .p.PythonDeephavenSession | Evaluating command: os.system("pip list")
docker-deephaven-1  | Package            Version
docker-deephaven-1  | ------------------ -----------
docker-deephaven-1  | certifi            2024.7.4
docker-deephaven-1  | charset-normalizer 3.3.2
docker-deephaven-1  | deephaven-core     0.35.2
docker-deephaven-1  | deephaven-plugin   0.6.0
docker-deephaven-1  | filelock           3.15.4
docker-deephaven-1  | fsspec             2024.6.1
docker-deephaven-1  | huggingface-hub    0.24.5
docker-deephaven-1  | idna               3.7
docker-deephaven-1  | jedi               0.18.2
docker-deephaven-1  | Jinja2             3.1.4
docker-deephaven-1  | jpy                0.17.0
docker-deephaven-1  | llvmlite           0.43.0
docker-deephaven-1  | MarkupSafe         2.1.5
docker-deephaven-1  | mpmath             1.3.0
docker-deephaven-1  | networkx           3.3
docker-deephaven-1  | numba              0.60.0
docker-deephaven-1  | numpy              2.0.1
docker-deephaven-1  | packaging          24.1
docker-deephaven-1  | pandas             2.2.2
docker-deephaven-1  | parso              0.8.4
docker-deephaven-1  | pip                24.1.2
docker-deephaven-1  | pyarrow            17.0.0
docker-deephaven-1  | python-dateutil    2.9.0.post0
docker-deephaven-1  | pytz               2024.1
docker-deephaven-1  | PyYAML             6.0.2
docker-deephaven-1  | regex              2024.7.24
docker-deephaven-1  | requests           2.32.3
docker-deephaven-1  | safetensors        0.4.4
docker-deephaven-1  | setuptools         71.1.0
docker-deephaven-1  | six                1.16.0
docker-deephaven-1  | sympy              1.13.2
docker-deephaven-1  | tokenizers         0.19.1
docker-deephaven-1  | torch              2.4.0
docker-deephaven-1  | tqdm               4.66.5
docker-deephaven-1  | transformers       4.44.0
docker-deephaven-1  | typing_extensions  4.12.2
docker-deephaven-1  | tzdata             2024.1
docker-deephaven-1  | urllib3            2.2.2
alexpeters1208 commented 3 months ago

Creating the minimum Docker env for reproducing:

(base) alexpeters@Alexs-MBP-2 Docker % cat requirements.txt
torch
transformers
(base) alexpeters@Alexs-MBP-2 Docker % cat Dockerfile
FROM ghcr.io/deephaven/server:latest

# copy python requirements and data
COPY requirements.txt /requirements.txt

# install python requirements
RUN pip install -r /requirements.txt && rm /requirements.txt
(base) alexpeters@Alexs-MBP-2 Docker % cat docker-compose.yml
services:
  deephaven:
    build: .
    ports:
      - '${DEEPHAVEN_PORT:-10000}:10000'
    volumes:
      - ./data:/data
    environment:
      - START_OPTS=-Xmx8g -DAuthHandlers=io.deephaven.auth.AnonymousAuthenticationHandler
alexpeters1208 commented 3 months ago

Currently, I am unable to repro with a clean pip environment. In pip-installed DH, I'm getting the scenario I described that JJ had on latest - the server does not crash, but hangs indefinitely on the second run of the last line. Here's the relevant pip list:

(ai-crash-venv) (base) alexpeters@Alexs-MBP-2 Pip % pip list
Package            Version
------------------ -----------
certifi            2024.7.4
charset-normalizer 3.3.2
click              8.1.7
deephaven-core     0.35.3
deephaven-plugin   0.6.0
deephaven-server   0.35.3
filelock           3.15.4
fsspec             2024.6.1
huggingface-hub    0.24.5
idna               3.7
java-utilities     0.3.0
jedi               0.18.2
Jinja2             3.1.4
jpy                0.17.0
llvmlite           0.43.0
MarkupSafe         2.1.5
mpmath             1.3.0
networkx           3.3
numba              0.60.0
numpy              2.0.1
packaging          24.1
pandas             2.2.2
parso              0.8.4
pip                24.0
pyarrow            17.0.0
python-dateutil    2.9.0.post0
pytz               2024.1
PyYAML             6.0.2
regex              2024.7.24
requests           2.32.3
safetensors        0.4.4
setuptools         72.2.0
six                1.16.0
sympy              1.13.2
tokenizers         0.19.1
torch              2.4.0
tqdm               4.66.5
transformers       4.44.0
typing_extensions  4.12.2
tzdata             2024.1
urllib3            2.2.2
jmao-denver commented 3 months ago
image

What @niloc132 has discovered might have some bearing (really just a wild guess) on this issue. It might be worth trying the code in Py3.11.