b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

Add safe tensor support to convert-llama.py #52

Closed DifferentialityDevelopment closed 1 month ago

DifferentialityDevelopment commented 1 month ago

I haven't yet updated the other model conversion scripts yet, but this allows you to convert any llama model that uses safetensor.

b4rtaz commented 1 month ago

Please update also docs/LLAMA.md.

DifferentialityDevelopment commented 1 month ago

Please update also docs/LLAMA.md.

I updated the usage a bit, though could probably mention that it would work with the hugging face repo for Llama as well.

b4rtaz commented 1 month ago

@DifferentialityDevelopment I'm wondering about this part:

        with safetensors.safe_open(model_file, framework="pt") as f:
            for layer in f.keys():
                layers.append({
                    "name" : layer,
                    "file" : model_file
                })

Are you sure that the source model has all layers in the correct order that is expected by Distributed Llama?

DifferentialityDevelopment commented 1 month ago

@DifferentialityDevelopment I'm wondering about this part:

        with safetensors.safe_open(model_file, framework="pt") as f:
            for layer in f.keys():
                layers.append({
                    "name" : layer,
                    "file" : model_file
                })

Are you sure that the source model has all layers in the correct order that is expected by Distributed Llama?

DId not check yet, will do a full convert on llama-3 8B Instruct, do a test with distributed llama and report back.

DifferentialityDevelopment commented 1 month ago

The convert process itself does seem to work fine, but will test once it finishes

python converter/convert-llama.py J:\Llama-3\Meta-Llama-3-8B-Instruct J:\Llama-3\Meta-Llama-3-8B-Instruct-Distributed q40 Model name: Meta-Llama-3-8B-Instruct Target float type: q40 Target file: dllama_meta-llama-3-8b-instruct_q40.bin Total layers: 291 Total chunks: 7 Unknown header key: head_size {'head_size': 128.0, 'n_layers': 32, 'n_heads': 32, 'n_kv_heads': 8, 'max_seq_len': 8192, 'rope_theta': 500000, 'arch_type': 11259136, 'n_experts': 0, 'n_active_experts': 0} 💿 Chunking model 1/7... Loading tensors for model.embed_tokens.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.embed_tokens.weight torch.Size([128256, 4096])... Saved q40 tensor in 123.95s, 295501824 bytes Loading tensors for model.layers.0.input_layernorm.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.0.input_layernorm.weight torch.Size([4096])... Saved q40 tensor in 0.00s, 2304 bytes Loading tensors for model.layers.0.mlp.down_proj.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.0.mlp.down_proj.weight torch.Size([4096, 14336])... Saved q40 tensor in 14.69s, 33030144 bytes Loading tensors for model.layers.0.mlp.gate_proj.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.0.mlp.gate_proj.weight torch.Size([14336, 4096])... Saved q40 tensor in 14.96s, 33030144 bytes Loading tensors for model.layers.0.mlp.up_proj.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.0.mlp.up_proj.weight torch.Size([14336, 4096])... Saved q40 tensor in 14.95s, 33030144 bytes Loading tensors for model.layers.0.post_attention_layernorm.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.0.post_attention_layernorm.weight torch.Size([4096])... Saved q40 tensor in 0.00s, 2304 bytes Loading tensors for model.layers.0.self_attn.k_proj.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.0.self_attn.k_proj.weight torch.Size([1024, 4096])... Saved q40 tensor in 1.08s, 2359296 bytes Loading tensors for model.layers.0.self_attn.o_proj.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.0.self_attn.o_proj.weight torch.Size([4096, 4096])... Saved q40 tensor in 4.37s, 9437184 bytes Loading tensors for model.layers.0.self_attn.q_proj.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.0.self_attn.q_proj.weight torch.Size([4096, 4096])... Saved q40 tensor in 4.27s, 9437184 bytes Loading tensors for model.layers.0.self_attn.v_proj.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.0.self_attn.v_proj.weight torch.Size([1024, 4096])... Saved q40 tensor in 1.05s, 2359296 bytes Loading tensors for model.layers.1.input_layernorm.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.1.input_layernorm.weight torch.Size([4096])... Saved q40 tensor in 0.00s, 2304 bytes Loading tensors for model.layers.1.mlp.down_proj.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.1.mlp.down_proj.weight torch.Size([4096, 14336])... Saved q40 tensor in 14.91s, 33030144 bytes Loading tensors for model.layers.1.mlp.gate_proj.weight from: model-00001-of-00004.safetensors 🔶 Exporting model.layers.1.mlp.gate_proj.weight torch.Size([14336, 4096])... Saved q40 tensor in 14.76s, 33030144 bytes

b4rtaz commented 1 month ago

Please consider also that, some models may have a different layers order by some reason.

DifferentialityDevelopment commented 1 month ago

Please consider also that, some models may have a different layers order by some reason.

I would think the order of the keys when loading a .safetensor model is the same as from the .pth file but I could be wrong, will do a bit of research.

DifferentialityDevelopment commented 1 month ago

Your absolutely right, the layers are not necessarily in the right order, see below output of their keys, I noticed that layer 9 only appears after layer 20. So I will need to fix the ordering. I'm not entirely sure where to place lm_head.weight and model.norm.weight, they appear near the end of the list. The other thing I have trouble with is that I'm not sure which of the layers is the feed_forward layer? Which is what you use in the pth conversion to get the hidden_dim size

Additionally they have a different name convention, so I had to change a few more things. Is this correct: [safetensor] model.embed_tokens.weight -> [pth] tok_embeddings.weight [safetensor] model.layers.0.mlp.gate_proj.weight -> [pth] layers.0.feed_forward.w1.weight [safetensor] model.layers.0.mlp.up_proj.weight -> [pth] layers.0.feed_forward.w2.weight [safetensor] model.layers.0.post_attention_layernorm.weight -> [pth] layers.0.attention_norm.weight [safetensor] model.norm.weight -> [pth] norm.weight

Keys:

model.embed_tokens.weight => 128256 model.layers.0.input_layernorm.weight => 4096 model.layers.0.mlp.down_proj.weight => 4096 model.layers.0.mlp.gate_proj.weight => 14336 model.layers.0.mlp.up_proj.weight => 14336 model.layers.0.post_attention_layernorm.weight => 4096 model.layers.0.self_attn.k_proj.weight => 1024 model.layers.0.self_attn.o_proj.weight => 4096 model.layers.0.self_attn.q_proj.weight => 4096 model.layers.0.self_attn.v_proj.weight => 1024 model.layers.1.input_layernorm.weight => 4096 model.layers.1.mlp.down_proj.weight => 4096 model.layers.1.mlp.gate_proj.weight => 14336 model.layers.1.mlp.up_proj.weight => 14336 model.layers.1.post_attention_layernorm.weight => 4096 model.layers.1.self_attn.k_proj.weight => 1024 model.layers.1.self_attn.o_proj.weight => 4096 model.layers.1.self_attn.q_proj.weight => 4096 model.layers.1.self_attn.v_proj.weight => 1024 model.layers.2.input_layernorm.weight => 4096 model.layers.2.mlp.down_proj.weight => 4096 model.layers.2.mlp.gate_proj.weight => 14336 model.layers.2.mlp.up_proj.weight => 14336 model.layers.2.post_attention_layernorm.weight => 4096 model.layers.2.self_attn.k_proj.weight => 1024 model.layers.2.self_attn.o_proj.weight => 4096 model.layers.2.self_attn.q_proj.weight => 4096 model.layers.2.self_attn.v_proj.weight => 1024 model.layers.3.input_layernorm.weight => 4096 model.layers.3.mlp.down_proj.weight => 4096 model.layers.3.mlp.gate_proj.weight => 14336 model.layers.3.mlp.up_proj.weight => 14336 model.layers.3.post_attention_layernorm.weight => 4096 model.layers.3.self_attn.k_proj.weight => 1024 model.layers.3.self_attn.o_proj.weight => 4096 model.layers.3.self_attn.q_proj.weight => 4096 model.layers.3.self_attn.v_proj.weight => 1024 model.layers.4.input_layernorm.weight => 4096 model.layers.4.mlp.down_proj.weight => 4096 model.layers.4.mlp.gate_proj.weight => 14336 model.layers.4.mlp.up_proj.weight => 14336 model.layers.4.post_attention_layernorm.weight => 4096 model.layers.4.self_attn.k_proj.weight => 1024 model.layers.4.self_attn.o_proj.weight => 4096 model.layers.4.self_attn.q_proj.weight => 4096 model.layers.4.self_attn.v_proj.weight => 1024 model.layers.5.input_layernorm.weight => 4096 model.layers.5.mlp.down_proj.weight => 4096 model.layers.5.mlp.gate_proj.weight => 14336 model.layers.5.mlp.up_proj.weight => 14336 model.layers.5.post_attention_layernorm.weight => 4096 model.layers.5.self_attn.k_proj.weight => 1024 model.layers.5.self_attn.o_proj.weight => 4096 model.layers.5.self_attn.q_proj.weight => 4096 model.layers.5.self_attn.v_proj.weight => 1024 model.layers.6.input_layernorm.weight => 4096 model.layers.6.mlp.down_proj.weight => 4096 model.layers.6.mlp.gate_proj.weight => 14336 model.layers.6.mlp.up_proj.weight => 14336 model.layers.6.post_attention_layernorm.weight => 4096 model.layers.6.self_attn.k_proj.weight => 1024 model.layers.6.self_attn.o_proj.weight => 4096 model.layers.6.self_attn.q_proj.weight => 4096 model.layers.6.self_attn.v_proj.weight => 1024 model.layers.7.input_layernorm.weight => 4096 model.layers.7.mlp.down_proj.weight => 4096 model.layers.7.mlp.gate_proj.weight => 14336 model.layers.7.mlp.up_proj.weight => 14336 model.layers.7.post_attention_layernorm.weight => 4096 model.layers.7.self_attn.k_proj.weight => 1024 model.layers.7.self_attn.o_proj.weight => 4096 model.layers.7.self_attn.q_proj.weight => 4096 model.layers.7.self_attn.v_proj.weight => 1024 model.layers.8.input_layernorm.weight => 4096 model.layers.8.mlp.down_proj.weight => 4096 model.layers.8.mlp.gate_proj.weight => 14336 model.layers.8.mlp.up_proj.weight => 14336 model.layers.8.post_attention_layernorm.weight => 4096 model.layers.8.self_attn.k_proj.weight => 1024 model.layers.8.self_attn.o_proj.weight => 4096 model.layers.8.self_attn.q_proj.weight => 4096 model.layers.8.self_attn.v_proj.weight => 1024 model.layers.10.input_layernorm.weight => 4096 model.layers.10.mlp.down_proj.weight => 4096 model.layers.10.mlp.gate_proj.weight => 14336 model.layers.10.mlp.up_proj.weight => 14336 model.layers.10.post_attention_layernorm.weight => 4096 model.layers.10.self_attn.k_proj.weight => 1024 model.layers.10.self_attn.o_proj.weight => 4096 model.layers.10.self_attn.q_proj.weight => 4096 model.layers.10.self_attn.v_proj.weight => 1024 model.layers.11.input_layernorm.weight => 4096 model.layers.11.mlp.down_proj.weight => 4096 model.layers.11.mlp.gate_proj.weight => 14336 model.layers.11.mlp.up_proj.weight => 14336 model.layers.11.post_attention_layernorm.weight => 4096 model.layers.11.self_attn.k_proj.weight => 1024 model.layers.11.self_attn.o_proj.weight => 4096 model.layers.11.self_attn.q_proj.weight => 4096 model.layers.11.self_attn.v_proj.weight => 1024 model.layers.12.input_layernorm.weight => 4096 model.layers.12.mlp.down_proj.weight => 4096 model.layers.12.mlp.gate_proj.weight => 14336 model.layers.12.mlp.up_proj.weight => 14336 model.layers.12.post_attention_layernorm.weight => 4096 model.layers.12.self_attn.k_proj.weight => 1024 model.layers.12.self_attn.o_proj.weight => 4096 model.layers.12.self_attn.q_proj.weight => 4096 model.layers.12.self_attn.v_proj.weight => 1024 model.layers.13.input_layernorm.weight => 4096 model.layers.13.mlp.down_proj.weight => 4096 model.layers.13.mlp.gate_proj.weight => 14336 model.layers.13.mlp.up_proj.weight => 14336 model.layers.13.post_attention_layernorm.weight => 4096 model.layers.13.self_attn.k_proj.weight => 1024 model.layers.13.self_attn.o_proj.weight => 4096 model.layers.13.self_attn.q_proj.weight => 4096 model.layers.13.self_attn.v_proj.weight => 1024 model.layers.14.input_layernorm.weight => 4096 model.layers.14.mlp.down_proj.weight => 4096 model.layers.14.mlp.gate_proj.weight => 14336 model.layers.14.mlp.up_proj.weight => 14336 model.layers.14.post_attention_layernorm.weight => 4096 model.layers.14.self_attn.k_proj.weight => 1024 model.layers.14.self_attn.o_proj.weight => 4096 model.layers.14.self_attn.q_proj.weight => 4096 model.layers.14.self_attn.v_proj.weight => 1024 model.layers.15.input_layernorm.weight => 4096 model.layers.15.mlp.down_proj.weight => 4096 model.layers.15.mlp.gate_proj.weight => 14336 model.layers.15.mlp.up_proj.weight => 14336 model.layers.15.post_attention_layernorm.weight => 4096 model.layers.15.self_attn.k_proj.weight => 1024 model.layers.15.self_attn.o_proj.weight => 4096 model.layers.15.self_attn.q_proj.weight => 4096 model.layers.15.self_attn.v_proj.weight => 1024 model.layers.16.input_layernorm.weight => 4096 model.layers.16.mlp.down_proj.weight => 4096 model.layers.16.mlp.gate_proj.weight => 14336 model.layers.16.mlp.up_proj.weight => 14336 model.layers.16.post_attention_layernorm.weight => 4096 model.layers.16.self_attn.k_proj.weight => 1024 model.layers.16.self_attn.o_proj.weight => 4096 model.layers.16.self_attn.q_proj.weight => 4096 model.layers.16.self_attn.v_proj.weight => 1024 model.layers.17.input_layernorm.weight => 4096 model.layers.17.mlp.down_proj.weight => 4096 model.layers.17.mlp.gate_proj.weight => 14336 model.layers.17.mlp.up_proj.weight => 14336 model.layers.17.post_attention_layernorm.weight => 4096 model.layers.17.self_attn.k_proj.weight => 1024 model.layers.17.self_attn.o_proj.weight => 4096 model.layers.17.self_attn.q_proj.weight => 4096 model.layers.17.self_attn.v_proj.weight => 1024 model.layers.18.input_layernorm.weight => 4096 model.layers.18.mlp.down_proj.weight => 4096 model.layers.18.mlp.gate_proj.weight => 14336 model.layers.18.mlp.up_proj.weight => 14336 model.layers.18.post_attention_layernorm.weight => 4096 model.layers.18.self_attn.k_proj.weight => 1024 model.layers.18.self_attn.o_proj.weight => 4096 model.layers.18.self_attn.q_proj.weight => 4096 model.layers.18.self_attn.v_proj.weight => 1024 model.layers.19.input_layernorm.weight => 4096 model.layers.19.mlp.down_proj.weight => 4096 model.layers.19.mlp.gate_proj.weight => 14336 model.layers.19.mlp.up_proj.weight => 14336 model.layers.19.post_attention_layernorm.weight => 4096 model.layers.19.self_attn.k_proj.weight => 1024 model.layers.19.self_attn.o_proj.weight => 4096 model.layers.19.self_attn.q_proj.weight => 4096 model.layers.19.self_attn.v_proj.weight => 1024 model.layers.20.mlp.gate_proj.weight => 14336 model.layers.20.self_attn.k_proj.weight => 1024 model.layers.20.self_attn.o_proj.weight => 4096 model.layers.20.self_attn.q_proj.weight => 4096 model.layers.20.self_attn.v_proj.weight => 1024 model.layers.9.input_layernorm.weight => 4096 model.layers.9.mlp.down_proj.weight => 4096 model.layers.9.mlp.gate_proj.weight => 14336 model.layers.9.mlp.up_proj.weight => 14336 model.layers.9.post_attention_layernorm.weight => 4096 model.layers.9.self_attn.k_proj.weight => 1024 model.layers.9.self_attn.o_proj.weight => 4096 model.layers.9.self_attn.q_proj.weight => 4096 model.layers.9.self_attn.v_proj.weight => 1024 model.layers.20.input_layernorm.weight => 4096 model.layers.20.mlp.down_proj.weight => 4096 model.layers.20.mlp.up_proj.weight => 14336 model.layers.20.post_attention_layernorm.weight => 4096 model.layers.21.input_layernorm.weight => 4096 model.layers.21.mlp.down_proj.weight => 4096 model.layers.21.mlp.gate_proj.weight => 14336 model.layers.21.mlp.up_proj.weight => 14336 model.layers.21.post_attention_layernorm.weight => 4096 model.layers.21.self_attn.k_proj.weight => 1024 model.layers.21.self_attn.o_proj.weight => 4096 model.layers.21.self_attn.q_proj.weight => 4096 model.layers.21.self_attn.v_proj.weight => 1024 model.layers.22.input_layernorm.weight => 4096 model.layers.22.mlp.down_proj.weight => 4096 model.layers.22.mlp.gate_proj.weight => 14336 model.layers.22.mlp.up_proj.weight => 14336 model.layers.22.post_attention_layernorm.weight => 4096 model.layers.22.self_attn.k_proj.weight => 1024 model.layers.22.self_attn.o_proj.weight => 4096 model.layers.22.self_attn.q_proj.weight => 4096 model.layers.22.self_attn.v_proj.weight => 1024 model.layers.23.input_layernorm.weight => 4096 model.layers.23.mlp.down_proj.weight => 4096 model.layers.23.mlp.gate_proj.weight => 14336 model.layers.23.mlp.up_proj.weight => 14336 model.layers.23.post_attention_layernorm.weight => 4096 model.layers.23.self_attn.k_proj.weight => 1024 model.layers.23.self_attn.o_proj.weight => 4096 model.layers.23.self_attn.q_proj.weight => 4096 model.layers.23.self_attn.v_proj.weight => 1024 model.layers.24.input_layernorm.weight => 4096 model.layers.24.mlp.down_proj.weight => 4096 model.layers.24.mlp.gate_proj.weight => 14336 model.layers.24.mlp.up_proj.weight => 14336 model.layers.24.post_attention_layernorm.weight => 4096 model.layers.24.self_attn.k_proj.weight => 1024 model.layers.24.self_attn.o_proj.weight => 4096 model.layers.24.self_attn.q_proj.weight => 4096 model.layers.24.self_attn.v_proj.weight => 1024 model.layers.25.input_layernorm.weight => 4096 model.layers.25.mlp.down_proj.weight => 4096 model.layers.25.mlp.gate_proj.weight => 14336 model.layers.25.mlp.up_proj.weight => 14336 model.layers.25.post_attention_layernorm.weight => 4096 model.layers.25.self_attn.k_proj.weight => 1024 model.layers.25.self_attn.o_proj.weight => 4096 model.layers.25.self_attn.q_proj.weight => 4096 model.layers.25.self_attn.v_proj.weight => 1024 model.layers.26.input_layernorm.weight => 4096 model.layers.26.mlp.down_proj.weight => 4096 model.layers.26.mlp.gate_proj.weight => 14336 model.layers.26.mlp.up_proj.weight => 14336 model.layers.26.post_attention_layernorm.weight => 4096 model.layers.26.self_attn.k_proj.weight => 1024 model.layers.26.self_attn.o_proj.weight => 4096 model.layers.26.self_attn.q_proj.weight => 4096 model.layers.26.self_attn.v_proj.weight => 1024 model.layers.27.input_layernorm.weight => 4096 model.layers.27.mlp.down_proj.weight => 4096 model.layers.27.mlp.gate_proj.weight => 14336 model.layers.27.mlp.up_proj.weight => 14336 model.layers.27.post_attention_layernorm.weight => 4096 model.layers.27.self_attn.k_proj.weight => 1024 model.layers.27.self_attn.o_proj.weight => 4096 model.layers.27.self_attn.q_proj.weight => 4096 model.layers.27.self_attn.v_proj.weight => 1024 model.layers.28.input_layernorm.weight => 4096 model.layers.28.mlp.down_proj.weight => 4096 model.layers.28.mlp.gate_proj.weight => 14336 model.layers.28.mlp.up_proj.weight => 14336 model.layers.28.post_attention_layernorm.weight => 4096 model.layers.28.self_attn.k_proj.weight => 1024 model.layers.28.self_attn.o_proj.weight => 4096 model.layers.28.self_attn.q_proj.weight => 4096 model.layers.28.self_attn.v_proj.weight => 1024 model.layers.29.input_layernorm.weight => 4096 model.layers.29.mlp.down_proj.weight => 4096 model.layers.29.mlp.gate_proj.weight => 14336 model.layers.29.mlp.up_proj.weight => 14336 model.layers.29.post_attention_layernorm.weight => 4096 model.layers.29.self_attn.k_proj.weight => 1024 model.layers.29.self_attn.o_proj.weight => 4096 model.layers.29.self_attn.q_proj.weight => 4096 model.layers.29.self_attn.v_proj.weight => 1024 model.layers.30.input_layernorm.weight => 4096 model.layers.30.mlp.down_proj.weight => 4096 model.layers.30.mlp.gate_proj.weight => 14336 model.layers.30.mlp.up_proj.weight => 14336 model.layers.30.post_attention_layernorm.weight => 4096 model.layers.30.self_attn.k_proj.weight => 1024 model.layers.30.self_attn.o_proj.weight => 4096 model.layers.30.self_attn.q_proj.weight => 4096 model.layers.30.self_attn.v_proj.weight => 1024 model.layers.31.mlp.gate_proj.weight => 14336 model.layers.31.mlp.up_proj.weight => 14336 model.layers.31.self_attn.k_proj.weight => 1024 model.layers.31.self_attn.o_proj.weight => 4096 model.layers.31.self_attn.q_proj.weight => 4096 model.layers.31.self_attn.v_proj.weight => 1024 lm_head.weight => 128256 model.layers.31.input_layernorm.weight => 4096 model.layers.31.mlp.down_proj.weight => 4096 model.layers.31.post_attention_layernorm.weight => 4096 model.norm.weight => 4096

b4rtaz commented 1 month ago

I reccomend to use the same appraoch as you can see in the convert_pth method. You should build a list with layer names, then you need to pass it to the loop. BTW: this loop could be extracted from these two functions.

b4rtaz commented 1 month ago

@DifferentialityDevelopment I'm closing this pull request. The convert-hf.py script introduced in the 0.7.0 version supports the safe tensor format and 3 model types.