b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

fix: convert-llama.py supports different max_seq_len. #51

Closed b4rtaz closed 1 month ago

b4rtaz commented 1 month ago

The previous implementation of the llama converter has set always max_seq_len: 2048. Llama 3 has a longer context than Llama 2, so now it's required to update params.json file manually before the conversion. Unfortunetly Llama repositorium doesn't have this parameter defined in the default configuration.