linuxmagic-mp commented 1 year ago

Prerequisites

Following the instructions in the README, Linux Ubuntu Focal

rm -rf build; mkdir build; cd buil cmake -DLLAMA_CUBLAS=1 .. cmake --build . --config Release

Expected Behavior

Simply trying to run make on a newly checked out version of gglim.cpp, was referenced here from the falcom.cpp, I know the README didn't probably get any love as it was moved to this fork.

-- Found CUDAToolkit: /usr/local/cuda-12.1/include (found version "12.1.105") -- cuBLAS found CMake Error at /usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:751 (message): Compiling the CUDA compiler identification source file "CMakeCUDACompilerId.cu" failed. ......

$ ptxas -arch=sm_30 -m64 "tmp/CMakeCUDACompilerId.ptx" -o

"tmp/CMakeCUDACompilerId.sm_30.cubin"

ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'

--error 0xff --

... /usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.26/Modules/CMakeDetermineCUDACompiler.cmake:307 (CMAKE_DETERMINE_COMPILER_ID) CMakeLists.txt:238 (enable_language)

$ lscpu Model name: AMD Ryzen Threadripper PRO 5955WX 16-Cores

commit 0eb3604c823658aa445957dfcfab81b9e51d4bad Vendor ID: AuthenticAMD Model name: AMD Ryzen Threadripper PRO 5955WX 16-Cores Virtualization: AMD-V Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

cebtenzzre commented 1 year ago

Please put command output between triple backticks ``` like this ``` so it's easier to read.

cmp-nct commented 1 year ago

You didn't state which GPU you use

From what I see here is that you possibly have a too old GPU for a CUDA compilation. SM_30 GPUs seems to have been dropped with CUDA 9 and we are at 12.1 now, it's 7 generations behind now

cmake -DLLAMA_CUBLAS=0 ..

that should compile it as pure CPU version, given the threadripper you will want to experiment with various -t settings (from 4 to 16) in terms of best performance.

linuxmagic-mp commented 1 year ago

Yes, I guess that is important to include. Brand new Nvidea 4090 24GB, and thanks for the tip on the -t settings, however I do want to use the GPU.. Let me know what other information you might need. Works with other projects, however usually use the 'make' methods rather than 'cmake'.

cmp-nct commented 1 year ago

I'm quite sure it's a local problem with the toolset setup or paths. I'm not a great fan of cmake, though I think it's better than "configure+make". Still I have my share of troubles with it.

I just checked out a vanilla clone of the repo on linux and it compiled fine using make as well as using cmake. Look at the readme here, it contains quite a couple useful hints on cuda/cublas troubles. https://github.com/cmp-nct/ggllm.cpp Especially the paths can make a major difference. Also restarting fresh helps (deleting the build directory content)

For make all I needed to do: export LLAMA_CUBLAS=1 make make falcon_main

linuxmagic-mp commented 1 year ago

Yes, 'make' works fine. Other than the warnings I noted in discussions. in the examples. But with so many cooks in the kitchen, didn't want to actually do any clean up pull requests.

cmp-nct commented 1 year ago

You can try this before cmake:

export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
export PATH="/usr/local/cuda/bin:$PATH"

There was also a CPATH or PATHC variable if I recall right, which sets the cuda compiler.

I think the problem originates from this: "ptxas -arch=sm_30" sm_30 is not a 4090, that's like a Geforce 700

Anyway: the binaries from "make" will work fine

linuxmagic-mp commented 1 year ago

Note: Using the standard 'make' method, was able to safely convert the model with use32 option, and other than missing a few steps, eg manually having to do make falcon_convert and make falcon_main, that wasn't clear from the README's, was able to successfully quantize it, including the output of the test (See Below) however the output is fairly slow. Need to see how we can speed things up.

./falcon_main -t 31 -m /home/michael/models/falcon-40b-ggml/ggml-model-qt_k_m.bin -p "Love relates to hate like" -n 50 
main: build = 770 (0eb3604)
main: seed  = 1687543831

CUDA Device Summary - 1 devices found
+------------------------------------+------------+-----------+-----------+-----------+-----------+
| Device                             | VRAM Total | VRAM Free | VRAM Used | Split at  | Device ID |
+------------------------------------+------------+-----------+-----------+-----------+-----------+
| NVIDIA GeForce RTX 4090            |   24217 MB |  23640 MB |    576 MB |      0.0% |  0 (Main) |
+------------------------------------+------------+-----------+-----------+-----------+-----------+
Total VRAM: 23.65 GB, Total available VRAM: 23.09 GB
--------------------
falcon.cpp: loading model from /home/michael/models/falcon-40b-ggml/ggml-model-qt_k_m.bin
falcon.cpp: file version 4
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: n_falcon_type      = 40
falcon_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: INFO: using n_batch > 1 will require additional VRAM per device: 2818.00 MB
falcon_model_load_internal: VRAM free: 23246.00 MB  of 24217.00 MB (in use:  970.00 MB)
falcon_model_load_internal: mem required  = 26033.23 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 0 of 60 layers to GPU, weights offloaded    0.00 MB
falcon_model_load_internal: estimated VRAM usage: 2818 MB
[==================================================] 100%  Tensors populated             
falcon_model_load_internal: VRAM free: 23246.00 MB  of 24217.00 MB (used:  970.00 MB)
falcon_init_from_file: kv self size  =  120.00 MB

system_info: n_threads = 31 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0

Love relates to hate like a magnet to iron.
The magnet can attract the iron but it can't make the iron stick to it.
Love attracts and draws people but it can't make people stay.
Love is unconditional but people aren't.

falcon_print_timings:        load time =  1681.88 ms
falcon_print_timings:      sample time =    38.98 ms /    50 runs   (    0.78 ms per token)
falcon_print_timings: prompt eval time =   450.49 ms /     5 tokens (   90.10 ms per token)
falcon_print_timings:        eval time = 15904.14 ms /    49 runs   (  324.57 ms per token)
falcon_print_timings:       total time = 16408.97 ms

cmp-nct commented 1 year ago

The readme could need a full rework. Try: -b 1 -ngl 100

also pull the latest changes from git, it looks like you are a couple commits behind

If you hold many models I recommend a short filename inside the model directory. I use just "q5_k" for example. That makes it easier switching between models without much typing.

linuxmagic-mp commented 1 year ago

That helped.. I will close this thread for now... Will update ot create a new ticket with the warnings.. a lot more after updating to the lastes ;) And see the README is updated so will take a read... New performance numbers a lot faster, output a little 'chunked' but...

User: Thank you, Bob. You are the

falcon_print_timings:        load time =  2167.10 ms
falcon_print_timings:      sample time =     3.51 ms /    91 runs   (    0.04 ms per token)
falcon_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
falcon_print_timings:        eval time = 26493.90 ms /   186 runs   (  142.44 ms per token)

cmp-nct commented 1 year ago

Regarding your 4090, you can squeeze a 40B q4_k 60/60 offloaded into it with the latest updates. Speed is 17-22 tokens/second now, so a multiple of your latest status

To work for my 4090 I have to use reserved MB -500, which forced some stale VRAM into RAM, then the card operates at 0 bytes free VRAM but it works.

linuxmagic-mp commented 1 year ago

Interesting, will have to wait a couple days before in front of the machine again, but will have to compare notes. Just considering whether to use the Falcon 40B Instruct model now, or continue with performance testing on the base model. Tweetable moment though I would say. What's next on your list? Might have to put your own roadmap up. Tackle longer 32k context? Should move to 'discussion' and put up your system specs, and your make/convert/quantize parameters, plus performance, and I can do the same?

cmp-nct commented 1 year ago

Regading falcon instruct: I am mostly using the foundation model, I'm not a big fan of the original instruct model. The Wizard instruct appears to be better and I have yet to test the OpenAssistant variant. I dislike the "As an AI model" type responses, the original falcon instruct is even worse.

Right now I am working on improving the tokenizer, it's not finding the optimal tokens. Might be a general problem or just related to the greater falcon vocabulary. llama has only half the vocabulary and simpler tokens.

Originally I wanted to work on optimized quantization today which will allow to generate falcon models that works optimized for the various hardware variants in use. Maybe there is still enough time remaining.

Then we have a couple paths to follow:

Usability improvements (more features for the application itself)
full GPU support
I'd like to see into quantized ggml falcon fine-tuning
it would be nice to switch fully to k-quants, in that case I'd consider dropping the old quants entirely (maybe except 4_0 as pure performance variant) - that needs 7B support though
longer context would be great, it's a bit further down the line as I believe it might require fine tuning. However I've seen an optimized ROPE algo which appears to stabilize it beyond 2048. So maybe it's actually a small change by now.
- It should also be noted that Falcon at 2000 tokens is a lot more than llama at 2000 tokens, it has a much larger vocabulary so in some cases it will use 1 token for 2-4 tokens llama requires.

cmp-nct / ggllm.cpp

Problem with cMake on Linux focal, Cuda #22

Prerequisites

Expected Behavior

$ ptxas -arch=sm_30 -m64 "tmp/CMakeCUDACompilerId.ptx" -o

--error 0xff --