FasterDecoding / REST

REST: Retrieval-Based Speculative Decoding, NAACL 2024
Apache License 2.0
158 stars 10 forks source link

LLama3 8B is not supported #17

Open liranringel opened 1 month ago

liranringel commented 1 month ago

When I run:

RAYON_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=0 python3 -m rest.inference.cli --datastore-path datastore/datastore_chat_small.idx --base-model meta-llama/Meta-Llama-3-8B-Instruct

I get:

RAYON_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=0 python3 -m rest.inference.cli --datastore-path datastore/datastore_chat_small.idx --base-model meta-llama/Meta-Llama-3-8B-Instruct Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.47s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. USER: hey ASSISTANT: Traceback (most recent call last): ... File "/home/liranringel/REST/rest/model/modeling_llama_kv.py", line 594, in forward key_states = past_key_value[0].cat(key_states, dim=2) File "/home/liranringel/REST/rest/model/kvcache.py", line 66, in cat dst.copy(tensor) RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 1

YudiZh commented 1 month ago

Have you encountered the problem of segmentation fault (core dumped) when using Llama-3-8B and running python3 get_datastore_chat.py --model-path Meta-Llama-3-8B-Instruct?