Prerequisites

Please answer the following questions for yourself before submitting an issue.

[Y ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ Y] I carefully followed the README.md.
[Y] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[Y] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

The Docker container cuda_simple runs in does not crash when a request is sent.

Current Behavior

Whenever a request is sent, the program crashes with this message:

CUDA error 716 at /tmp/pip-install-wor20xk7/llama-cpp-python_3077f152adad4f479ee5f8ba791fa89a/vendor/llama.cpp/ggml-cuda.cu:7104: misaligned address
current device: 0

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux: The VM has the following characteristics:


Architecture:            x86_64
CPU op-mode(s):        32-bit, 64-bit
Address sizes:         40 bits physical, 57 bits virtual
Byte Order:            Little Endian
CPU(s):                  120
On-line CPU(s) list:   0-119
Vendor ID:               GenuineIntel
Model name:            Intel Xeon Processor (Icelake)
CPU family:          6
Model:               134
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           120
Stepping:            0
BogoMIPS:            3990.62
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
                     ca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall
                      nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopo
                     logy cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma 
                     cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadli
                     ne_timer aes xsave avx f16c rdrand hypervisor lahf_lm 
                     abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs
                      ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority
                      ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bm
                     i2 erms invpcid avx512f avx512dq rdseed adx smap avx51
                     2ifma clflushopt clwb avx512cd sha_ni avx512bw avx512v
                     l xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512v
                     bmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq a
                     vx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid m
                     d_clear arch_capabilities
Virtualization features: 
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
Caches (sum of all):     
L1d:                   3.8 MiB (120 instances)
L1i:                   3.8 MiB (120 instances)
L2:                    480 MiB (120 instances)
L3:                    1.9 GiB (120 instances)
NUMA:                    
NUMA node(s):          1
NUMA node0 CPU(s):     0-119
Vulnerabilities:         
Gather data sampling:  Not affected
Itlb multihit:         Not affected
L1tf:                  Not affected
Mds:                   Not affected
Meltdown:              Not affected
Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode;
                      SMT Host state unknown
Retbleed:              Not affected
Spec rstack overflow:  Not affected
Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prct
                     l and seccomp
Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointe
                     r sanitization
Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filli
                     ng, PBRSB-eIBRS Not affected
Srbds:                 Not affected
Tsx async abort:       Not affected

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 244536 C python3 47879MiB | +-----------------------------------------------------------------------------+


* Operating System, e.g. for Linux:
For the VM:

Linux llamab 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux


* SDK version, e.g. for Linux:
The image was built using the Dockerfile under cuda_simple has the following characteristics:

Linux 36009f992dd3 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+


The Docker container was started within a Docker compose file:

model: image: cuda_simple:${MODEL_VERSION} deploy: replicas: ${N_MODEL_REPLICAS} resources: reservations: devices:

driver: nvidia count: 1 capabilities: ["gpu"] ulimits: memlock: soft: -1 hard: -1 volumes:
- ${MODEL_HOME}:/models environment:
- MODEL=/models/${MODEL_FILE}
- HOST=0.0.0.0
- PORT=${MODEL_PORT}
- USE_MLOCK=0
- N_GPU_LAYERS=${N_GPU_LAYERS}
  Necessary due to CUD 12.0 on the server but 12.1 requested (avoids checking CUDA version at start)
- NVIDIA_DISABLE_REQUIRE=1 cap_add:
- SYS_RESOURCE restart: unless-stopped ports:
- ${MODEL_PORT}

Failure Information (for bugs)

It looks like a bug. It used to work up until some days ago, but on a different VM, hence it may be related to some subtle environment changes.

Steps to Reproduce

The image was built from the current state of the main branch (commit# `96a377648c97113f443cafd41b6b9ae7f0e4e5ef``) using the provided Dockerfile.

Failure Logs

CUDA error 716 at /tmp/pip-install-wor20xk7/llama-cpp-python_3077f152adad4f479ee5f8ba791fa89a/vendor/llama.cpp/ggml-cuda.cu:7104: misaligned address
current device: 0

abetlen / llama-cpp-python

CUDA error 716 running the cuda_simple image #919