b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k stars 69 forks source link

Unknown header key's while converting llama 3 70b to distributed format #40

Open DifferentialityDevelopment opened 2 months ago

DifferentialityDevelopment commented 2 months ago

Hi there

I'm busy converting llama 3 70b to the distributed format, but I get the following output:

Target float type: q40 Target file: D:\Meta-Llama-3-70B-Instruct-Distributed\dllama_original_q40.bin 💿 Chunking model 1/16... Unknown header key: ffn_dim_multiplier Unknown header key: multiple_of Unknown header key: norm_eps Unknown header key: head_size {'dim': 8192, 'ffn_dim_multiplier': 1.3, 'multiple_of': 4096, 'n_heads': 64, 'n_kv_heads': 8, 'n_layers': 80, 'norm_eps': 1e-05, 'vocab_size': 128256, 'rope_theta': 500000, 'head_size': 128.0, 'max_seq_len': 2048, 'arch_type': 11259136, 'n_experts': 0, 'n_active_experts': 0, 'hidden_dim': 28672} 🔶 Exporting tok_embeddings.weight torch.Size([16032, 65536])... Saved f32 tensor in 72.36s, 4202692608 bytes 🔶 Exporting layers.0.attention.wq.weight torch.Size([8192, 8192])... Saved q40 tensor in 15.90s, 37748736 bytes 🔶 Exporting layers.0.attention.wk.weight torch.Size([1024, 8192])... Saved q40 tensor in 1.99s, 4718592 bytes

Would it still work fine? Conversion process so far is really slow on my machine, should be done in a couple of hours

b4rtaz commented 2 months ago

Hello @DifferentialityDevelopment, yes it should be fine. The converter is slow, this is completely not optimized part yet.