Open linuxmagic-mp opened 1 year ago
Please put command output between triple backticks ``` like this ``` so it's easier to read.
You didn't state which GPU you use
From what I see here is that you possibly have a too old GPU for a CUDA compilation. SM_30 GPUs seems to have been dropped with CUDA 9 and we are at 12.1 now, it's 7 generations behind now
cmake -DLLAMA_CUBLAS=0 ..
that should compile it as pure CPU version, given the threadripper you will want to experiment with various -t settings (from 4 to 16) in terms of best performance.
Yes, I guess that is important to include. Brand new Nvidea 4090 24GB, and thanks for the tip on the -t settings, however I do want to use the GPU.. Let me know what other information you might need. Works with other projects, however usually use the 'make' methods rather than 'cmake'.
I'm quite sure it's a local problem with the toolset setup or paths. I'm not a great fan of cmake, though I think it's better than "configure+make". Still I have my share of troubles with it.
I just checked out a vanilla clone of the repo on linux and it compiled fine using make as well as using cmake. Look at the readme here, it contains quite a couple useful hints on cuda/cublas troubles. https://github.com/cmp-nct/ggllm.cpp Especially the paths can make a major difference. Also restarting fresh helps (deleting the build directory content)
For make all I needed to do: export LLAMA_CUBLAS=1 make make falcon_main
Yes, 'make' works fine. Other than the warnings I noted in discussions. in the examples. But with so many cooks in the kitchen, didn't want to actually do any clean up pull requests.
You can try this before cmake:
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
export PATH="/usr/local/cuda/bin:$PATH"
There was also a CPATH or PATHC variable if I recall right, which sets the cuda compiler.
I think the problem originates from this: "ptxas -arch=sm_30" sm_30 is not a 4090, that's like a Geforce 700
Anyway: the binaries from "make" will work fine
Note: Using the standard 'make' method, was able to safely convert the model with use32 option, and other than missing a few steps, eg manually having to do make falcon_convert
and make falcon_main
, that wasn't clear from the README's, was able to successfully quantize it, including the output of the test (See Below) however the output is fairly slow. Need to see how we can speed things up.
./falcon_main -t 31 -m /home/michael/models/falcon-40b-ggml/ggml-model-qt_k_m.bin -p "Love relates to hate like" -n 50
main: build = 770 (0eb3604)
main: seed = 1687543831
CUDA Device Summary - 1 devices found
+------------------------------------+------------+-----------+-----------+-----------+-----------+
| Device | VRAM Total | VRAM Free | VRAM Used | Split at | Device ID |
+------------------------------------+------------+-----------+-----------+-----------+-----------+
| NVIDIA GeForce RTX 4090 | 24217 MB | 23640 MB | 576 MB | 0.0% | 0 (Main) |
+------------------------------------+------------+-----------+-----------+-----------+-----------+
Total VRAM: 23.65 GB, Total available VRAM: 23.09 GB
--------------------
falcon.cpp: loading model from /home/michael/models/falcon-40b-ggml/ggml-model-qt_k_m.bin
falcon.cpp: file version 4
falcon_model_load_internal: format = ggjt v3 (latest)
falcon_model_load_internal: n_vocab = 65024
falcon_model_load_internal: n_ctx = 512
falcon_model_load_internal: n_embd = 8192
falcon_model_load_internal: n_head = 128
falcon_model_load_internal: n_head_kv = 8
falcon_model_load_internal: n_layer = 60
falcon_model_load_internal: n_falcon_type = 40
falcon_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
falcon_model_load_internal: n_ff = 32768
falcon_model_load_internal: n_parts = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size = 0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: INFO: using n_batch > 1 will require additional VRAM per device: 2818.00 MB
falcon_model_load_internal: VRAM free: 23246.00 MB of 24217.00 MB (in use: 970.00 MB)
falcon_model_load_internal: mem required = 26033.23 MB (+ 120.00 MB per state)
falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded 0.00 MB
falcon_model_load_internal: estimated VRAM usage: 2818 MB
[==================================================] 100% Tensors populated
falcon_model_load_internal: VRAM free: 23246.00 MB of 24217.00 MB (used: 970.00 MB)
falcon_init_from_file: kv self size = 120.00 MB
system_info: n_threads = 31 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0
Love relates to hate like a magnet to iron.
The magnet can attract the iron but it can't make the iron stick to it.
Love attracts and draws people but it can't make people stay.
Love is unconditional but people aren't.
falcon_print_timings: load time = 1681.88 ms
falcon_print_timings: sample time = 38.98 ms / 50 runs ( 0.78 ms per token)
falcon_print_timings: prompt eval time = 450.49 ms / 5 tokens ( 90.10 ms per token)
falcon_print_timings: eval time = 15904.14 ms / 49 runs ( 324.57 ms per token)
falcon_print_timings: total time = 16408.97 ms
The readme could need a full rework. Try: -b 1 -ngl 100
also pull the latest changes from git, it looks like you are a couple commits behind
If you hold many models I recommend a short filename inside the model directory. I use just "q5_k" for example. That makes it easier switching between models without much typing.
That helped.. I will close this thread for now... Will update ot create a new ticket with the warnings.. a lot more after updating to the lastes ;) And see the README is updated so will take a read... New performance numbers a lot faster, output a little 'chunked' but...
User: Thank you, Bob. You are the
falcon_print_timings: load time = 2167.10 ms
falcon_print_timings: sample time = 3.51 ms / 91 runs ( 0.04 ms per token)
falcon_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token)
falcon_print_timings: eval time = 26493.90 ms / 186 runs ( 142.44 ms per token)
Regarding your 4090, you can squeeze a 40B q4_k 60/60 offloaded into it with the latest updates. Speed is 17-22 tokens/second now, so a multiple of your latest status
To work for my 4090 I have to use reserved MB -500, which forced some stale VRAM into RAM, then the card operates at 0 bytes free VRAM but it works.
Interesting, will have to wait a couple days before in front of the machine again, but will have to compare notes. Just considering whether to use the Falcon 40B Instruct model now, or continue with performance testing on the base model. Tweetable moment though I would say. What's next on your list? Might have to put your own roadmap up. Tackle longer 32k context? Should move to 'discussion' and put up your system specs, and your make/convert/quantize parameters, plus performance, and I can do the same?
Regading falcon instruct: I am mostly using the foundation model, I'm not a big fan of the original instruct model. The Wizard instruct appears to be better and I have yet to test the OpenAssistant variant. I dislike the "As an AI model" type responses, the original falcon instruct is even worse.
Right now I am working on improving the tokenizer, it's not finding the optimal tokens. Might be a general problem or just related to the greater falcon vocabulary. llama has only half the vocabulary and simpler tokens.
Originally I wanted to work on optimized quantization today which will allow to generate falcon models that works optimized for the various hardware variants in use. Maybe there is still enough time remaining.
Then we have a couple paths to follow:
Prerequisites
Following the instructions in the README, Linux Ubuntu Focal
rm -rf build; mkdir build; cd buil cmake -DLLAMA_CUBLAS=1 .. cmake --build . --config Release
Expected Behavior
Simply trying to run make on a newly checked out version of gglim.cpp, was referenced here from the falcom.cpp, I know the README didn't probably get any love as it was moved to this fork.
-- Found CUDAToolkit: /usr/local/cuda-12.1/include (found version "12.1.105") -- cuBLAS found CMake Error at /usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:751 (message): Compiling the CUDA compiler identification source file "CMakeCUDACompilerId.cu" failed. ......
$ ptxas -arch=sm_30 -m64 "tmp/CMakeCUDACompilerId.ptx" -o
"tmp/CMakeCUDACompilerId.sm_30.cubin"
ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'
--error 0xff --
... /usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.26/Modules/CMakeDetermineCUDACompiler.cmake:307 (CMAKE_DETERMINE_COMPILER_ID) CMakeLists.txt:238 (enable_language)
$ lscpu
Model name: AMD Ryzen Threadripper PRO 5955WX 16-Corescommit 0eb3604c823658aa445957dfcfab81b9e51d4bad Vendor ID: AuthenticAMD Model name: AMD Ryzen Threadripper PRO 5955WX 16-Cores Virtualization: AMD-V Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca