Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k
stars
69
forks
source link
Unknown header key's while converting llama 3 70b to distributed format #40
Hi there
I'm busy converting llama 3 70b to the distributed format, but I get the following output:
Target float type: q40 Target file: D:\Meta-Llama-3-70B-Instruct-Distributed\dllama_original_q40.bin 💿 Chunking model 1/16... Unknown header key: ffn_dim_multiplier Unknown header key: multiple_of Unknown header key: norm_eps Unknown header key: head_size {'dim': 8192, 'ffn_dim_multiplier': 1.3, 'multiple_of': 4096, 'n_heads': 64, 'n_kv_heads': 8, 'n_layers': 80, 'norm_eps': 1e-05, 'vocab_size': 128256, 'rope_theta': 500000, 'head_size': 128.0, 'max_seq_len': 2048, 'arch_type': 11259136, 'n_experts': 0, 'n_active_experts': 0, 'hidden_dim': 28672} 🔶 Exporting tok_embeddings.weight torch.Size([16032, 65536])... Saved f32 tensor in 72.36s, 4202692608 bytes 🔶 Exporting layers.0.attention.wq.weight torch.Size([8192, 8192])... Saved q40 tensor in 15.90s, 37748736 bytes 🔶 Exporting layers.0.attention.wk.weight torch.Size([1024, 8192])... Saved q40 tensor in 1.99s, 4718592 bytes
Would it still work fine? Conversion process so far is really slow on my machine, should be done in a couple of hours