Pathos14489 commented 10 months ago

Expected Behavior

I wanted to load two models at once to swap between them based on speed or quality. Here's the related code:

from llama_cpp import Llama

total_threads = 16
n_threads = int(total_threads/2)

fast_path = "./Synthia-7B-v1.3.q5_k_m.gguf"
smart_path = "./mythalion-13b.Q4_K_M.gguf"

print("Loading Fast Llama2")
llama = Llama(
    model_path=fast_path,
    n_gpu_layers=999,
    n_ctx=5120,
    n_batch=1024,
    n_threads=n_threads,
    # verbose=True,
    tensor_split=[1.0,0.0],
    main_gpu=0,
)
print("TESTING MODELS:")
test_1 = llama.create_completion("test", max_tokens=10) # This functions as expected
print("Test:",test_1)

print("Loading Smart Llama2")
smart_llama = Llama(
    model_path=smart_path,
    n_gpu_layers=999,
    n_ctx=5120,
    n_batch=1024,
    n_threads=n_threads,
    # verbose=True,
    tensor_split=[0.0,1.0],
    main_gpu=1,
)
print("TESTING MODELS:")
test_2 = llama.create_completion("test", max_tokens=10) # Fails here
test_3 = smart_llama.create_completion("test", max_tokens=10)
print("Test 2:",test_2)
print("Test 3:",test_3)

When this did not work, I also tried the following edit, which similarly did not work:

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
print("Loading Fast Llama2")
llama = Llama(
    model_path=fast_path,
    n_gpu_layers=999,
    n_ctx=5120,
    n_batch=1024,
    n_threads=n_threads,
    # verbose=True,
    # tensor_split=[1.0,0.0],
    # main_gpu=0,
)
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
print("Loading Smart Llama2")
smart_llama = Llama(
    model_path=smart_path,
    n_gpu_layers=999,
    n_ctx=5120,
    n_batch=1024,
    n_threads=n_threads,
    # verbose=True,
    # tensor_split=[0.0,1.0],
    # main_gpu=1,
)

Current Behavior

Crashes with the following message: CUDA error 400 at /tmp/pip-install-l820kql8/llama-cpp-python_b22d62cce7f540a0ae17b83dd03f27d3/vendor/llama.cpp/ggml-cuda.cu:7308: invalid resource handle current device: 1

Environment and Context

Ryzen 9 5950x, 64GB of DDR4 2333mhz

Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu -> Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 5950X 16-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 2 Frequency boost: enabled CPU max MHz: 5083.3979 CPU min MHz: 2200.0000 BogoMIPS: 6800.56 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsave opt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm Virtualization features: Virtualization: AMD-V Caches (sum of all):
L1d: 512 KiB (16 instances) L1i: 512 KiB (16 instances) L2: 8 MiB (16 instances) L3: 64 MiB (2 instances) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerabilities:
Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected

Operating System, e.g. for Linux:

$ uname -a -> Linux pathos-mint 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

SDK version, e.g. for Linux:

$ python3 --version -> Python 3.10.6
$ make --version ->
GNU Make 4.3
Built for x86_64-pc-linux-gnu
$ g++ --version -> g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

abetlen commented 10 months ago

@Pathos14489 did the first version also crash? I don't think changing cuda visible devices works or at least I'm not sure what the behaviour would be wrt llama.cpp cuda internals.

Pathos14489 commented 9 months ago

Ah sorry I didn't notice the response. Yes, both versions crashed.

Edit: By the way, CUDA_VISIBLE_DEVICES used in that way does work as intended when only one model is being loaded.

EverThingy commented 9 months ago

Have you managed to get the script working?

abetlen / llama-cpp-python

Loading Multiple Models at Once on Seperate Devices, CUDA error 400 #974

Expected Behavior

Current Behavior

Environment and Context