ankan-ban / llama_cu_awq

llama INT4 cuda inference with AWQ
MIT License
46 stars 6 forks source link

./llama2_q4: No such file or directory #6

Open mauridev777 opened 1 year ago

mauridev777 commented 1 year ago

Hello, trying to run this in a google colab notebook and I get : "./llama2_q4: No such file or directory" when running : !./llama2_q4 llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs" this can be used on colab? thanks

mauridev777 commented 1 year ago

using : !./llama2_q4.cu llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs" gives error: ./llama2_q4.cu: Permission denied

GilesBathgate commented 9 months ago

If you've compiled it using the instructions then it will be in build/llama2_q4

You can probably just do !ln -s build/llama2_q4 and then it will work as instructed.

iamsiddhantsahu commented 2 months ago

@GilesBathgate @ankan-ban cmake --build . --config Release is giving me error

[ 25%] Building CXX object CMakeFiles/weight_packer.dir/weight_packer.cpp.o
/datasets/sisahu/llama_cu_awq/weight_packer.cpp: In function ‘int main(int, char**)’:
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:30: warning: ‘.input_layernorm.weight.bin’ directive writing 27 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
  287 |         sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:287:16: note: ‘sprintf’ output between 28 and 539 bytes into a destination of size 512
  287 |         sprintf(filename, "%s.input_layernorm.weight.bin", fileNameBase);
      |         ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:30: warning: ‘.post_attention_layernorm.we...’ directive writing 36 bytes into a region of size between 1 and 512 [-Wformat-overflow=]
  290 |         sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/datasets/sisahu/llama_cu_awq/weight_packer.cpp:290:16: note: ‘sprintf’ output between 37 and 548 bytes into a destination of size 512
  290 |         sprintf(filename, "%s.post_attention_layernorm.weight.bin", fileNameBase);
      |         ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 50%] Linking CXX executable weight_packer
[ 50%] Built target weight_packer
[ 75%] Building CUDA object CMakeFiles/llama2_q4.dir/llama2_q4.cu.o
/datasets/sisahu/llama_cu_awq/llama2_q4.cu(368): error: too few arguments in function call

1 error detected in the compilation of "/datasets/sisahu/llama_cu_awq/llama2_q4.cu".
make[2]: *** [CMakeFiles/llama2_q4.dir/build.make:76: CMakeFiles/llama2_q4.dir/llama2_q4.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:111: CMakeFiles/llama2_q4.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

Do you know what could be the error?

GilesBathgate commented 2 months ago

@iamsiddhantsahu I think you have an old cuda driver, or wrong version.

You could try to change the line #define USE_CUDA_GRAPHS 1 to #define USE_CUDA_GRAPHS 0 in /datasets/sisahu/llama_cu_awq/llama2_q4.cu

iamsiddhantsahu commented 2 months ago

@GilesBathgate Many thanks for the answer -- yes indeed you were right setting #define USE_CUDA_GRAPHS 0 did made it compile.

Then, I would assume this would not run this part of the code -- which seems do be CUDA Graph API -- does it mean that my network would be less optimized?

    int graphIndex;
    int seq_len_bin = 128;
    for (graphIndex = 0; graphIndex < MAX_GRAPHS - 1; seq_len_bin *= 2, graphIndex++)
        if (seq_len <= seq_len_bin) break;
    if ((seq_len > seq_len_bin) || (graphIndex == MAX_GRAPHS - 1)) seq_len_bin = p->seq_len;    // last bin holds max seq len

    if (!graphCaptured[graphIndex])
    {
        cudaGraph_t graph = {};
        cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
        run_llama_network(s->pos, p, s, w, seq_len_bin);
        cudaStreamEndCapture(stream, &graph);
        cudaGraphInstantiate(&cudaGraphInstance[graphIndex], graph, 0);
        cudaGraphDestroy(graph);
        graphCaptured[graphIndex] = true;
    }
    cudaGraphLaunch(cudaGraphInstance[graphIndex], stream);

Another thing is -- I wanted to incorporate FlashAttention to this code base, and see if it gives me better tokens per second. Any thoughts on that? @GilesBathgate and @ankan-ban

GilesBathgate commented 2 months ago

@iamsiddhantsahu I don't know what improvement using cudaGraph is intended to achieve. I find the tok/s adequate without it. Regarding speed the KV-cache already improves the speed for inference. FlashAttention might add some gains but could take considerable effort to implement from scratch.

iamsiddhantsahu commented 2 months ago

@GilesBathgate I just tried it with on the NVIDIA Jetson Orin (64 GB) -- I am getting about 30 tokens/sec with this implementation -- and I also tried the GGML implementation https://github.com/ggerganov/llama.cpp -- this is giving performance of about 250-300 tokens/sec

GilesBathgate commented 2 months ago

@iamsiddhantsahu I think Llama2_cu is simple and for educational purposes. See also https://github.com/karpathy/llm.c . GGML is better supported and has flashattention https://github.com/ggerganov/llama.cpp/pull/5021