Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k
stars
68
forks
source link
fix: convert-llama.py supports different max_seq_len. #51
The previous implementation of the llama converter has set always max_seq_len: 2048. Llama 3 has a longer context than Llama 2, so now it's required to update params.json file manually before the conversion. Unfortunetly Llama repositorium doesn't have this parameter defined in the default configuration.
The previous implementation of the llama converter has set always
max_seq_len: 2048
. Llama 3 has a longer context than Llama 2, so now it's required to updateparams.json
file manually before the conversion. Unfortunetly Llama repositorium doesn't have this parameter defined in the default configuration.