Open mauridev777 opened 1 year ago
using : !./llama2_q4.cu llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs" gives error: ./llama2_q4.cu: Permission denied
If you've compiled it using the instructions then it will be in build/llama2_q4
You can probably just do !ln -s build/llama2_q4
and then it will work as instructed.
@GilesBathgate @ankan-ban cmake --build . --config Release
is giving me error
[ 25%] Building CXX object CMakeFiles/weight_packer.dir/weight_packer.cpp.o
/datasets/sisahu/llama_cu_awq/weight_packer.cpp: In function ‘int main(int, char**)’:
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:30: warning: ‘.input_layernorm.weight.bin’ directive writing 27 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
287 | sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:16: note: ‘sprintf’ output between 28 and 539 bytes into a destination of size 512
287 | sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
| ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:30: warning: ‘.post_attention_layernorm.we...’ directive writing 36 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
290 | sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:16: note: ‘sprintf’ output between 37 and 548 bytes into a destination of size 512
290 | sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
| ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 50%] Linking CXX executable weight_packer
[ 50%] Built target weight_packer
[ 75%] Building CUDA object CMakeFiles/llama2_q4.dir/llama2_q4.cu.o
/datasets/sisahu/llama_cu_awq/llama2_q4.cu(368): error: too few arguments in function call
1 error detected in the compilation of "/datasets/sisahu/llama_cu_awq/llama2_q4.cu".
make[2]: *** [CMakeFiles/llama2_q4.dir/build.make:76: CMakeFiles/llama2_q4.dir/llama2_q4.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:111: CMakeFiles/llama2_q4.dir/all] Error 2
make: *** [Makefile:91: all] Error 2
Do you know what could be the error?
@iamsiddhantsahu I think you have an old cuda driver, or wrong version.
You could try to change the line #define USE_CUDA_GRAPHS 1
to #define USE_CUDA_GRAPHS 0
in /datasets/sisahu/llama_cu_awq/llama2_q4.cu
@GilesBathgate Many thanks for the answer -- yes indeed you were right setting #define USE_CUDA_GRAPHS 0
did made it compile.
Then, I would assume this would not run this part of the code -- which seems do be CUDA Graph API -- does it mean that my network would be less optimized?
int graphIndex;
int seq_len_bin = 128;
for (graphIndex = 0; graphIndex < MAX_GRAPHS - 1; seq_len_bin *= 2, graphIndex++)
if (seq_len <= seq_len_bin) break;
if ((seq_len > seq_len_bin) || (graphIndex == MAX_GRAPHS - 1)) seq_len_bin = p->seq_len; // last bin holds max seq len
if (!graphCaptured[graphIndex])
{
cudaGraph_t graph = {};
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
run_llama_network(s->pos, p, s, w, seq_len_bin);
cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&cudaGraphInstance[graphIndex], graph, 0);
cudaGraphDestroy(graph);
graphCaptured[graphIndex] = true;
}
cudaGraphLaunch(cudaGraphInstance[graphIndex], stream);
Another thing is -- I wanted to incorporate FlashAttention to this code base, and see if it gives me better tokens per second. Any thoughts on that? @GilesBathgate and @ankan-ban
@iamsiddhantsahu I don't know what improvement using cudaGraph is intended to achieve. I find the tok/s adequate without it. Regarding speed the KV-cache already improves the speed for inference. FlashAttention might add some gains but could take considerable effort to implement from scratch.
@GilesBathgate I just tried it with on the NVIDIA Jetson Orin (64 GB)
-- I am getting about 30 tokens/sec
with this implementation -- and I also tried the GGML
implementation https://github.com/ggerganov/llama.cpp -- this is giving performance of about 250-300 tokens/sec
@iamsiddhantsahu I think Llama2_cu
is simple and for educational purposes. See also https://github.com/karpathy/llm.c . GGML is better supported and has flashattention https://github.com/ggerganov/llama.cpp/pull/5021
Hello, trying to run this in a google colab notebook and I get : "./llama2_q4: No such file or directory" when running : !./llama2_q4 llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs" this can be used on colab? thanks