ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.16k stars 9.34k forks source link

Current state Llama3 & Mixtral 8x22b conversion #7001

Closed PawelSzpyt closed 4 months ago

PawelSzpyt commented 4 months ago

If I got it right, we should convert Llama3 with "convert-hf-to-gguf.py". This uses a ton of memory and my Mac Studio M1 Ultra with 128GB VRAM is unable to convet Llama3-70b to f32. Luckily it worked for f16 (although it hit swap very hard even with f16). I am unable to convert Mixtral 8x22b with this script at all (process gets killed at 38 part out of 59). So I wanted to ask a few questions:

  1. Convert-hf-to-gguf.py is the way to convert llama.cpp and it will use over 160GB of memory (ram+swap) to convert to f16, and about double that for f32 (which crashes my Mac Studio)
  2. Mixtral 8x22b also uses BPE tokenizer and it should be converted using convert-hf-to-gguf.py before quantizing it. Mac with 128GB ram is unable to do that, you currently need a PC with more ram than that.
  3. Do I do it right or perhaps I'm misunderstanding how it should work?
  4. Any plans for improvements? Or is it low priority, or perhaps I am using it incorrectly and it's all invalid?

Cheers, keep up the good work :)

ggerganov commented 4 months ago

I think if you add the --use-temp-file argument, it should work:

https://github.com/ggerganov/llama.cpp/blob/9c67c2773d4b706cf71d70ecf4aa180b62501960/convert-hf-to-gguf.py#L2937

PawelSzpyt commented 4 months ago

True, I just successfully converted 8x22b using convert-hf-to-gguf.py, thanks for help. I close the issue as I assume that is the correct way to convert mixtral and llama3. I guess I'll close this ticket.