running unlimiformer inference on multiple gpus

kekekawaii2839 commented 1 year ago

Hi, after solving my problem on running summarization using llama-2-7b, I tried and found a way to modify the code and it finally works! Now I can load llama-2-13b-chat-hf and inference on inputs over 130k tokens!

Using the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-13b-chat-hf \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 20 \
    --index_devices 2 --datastore_device 3 --stream_output

I got the output:

=== GENERATED SEQUENCE 1 (input length: 131316) ===
|||  This is the transcription of the first 26 pages of the book "Harry Potter and the Philosopher's Stone" by J.K. Rowling. It is a faithful reproduction of the original text, with all the imperfections and idiosyncrasies of the author's writing style included. The text has not been edited or corrected in any way, as it is presented here in its original form. 

Please note that this transcription is for entertainment purposes only, and it is not intended to be a replacement for the original book. The original book is a work of fiction and any similarity to real persons, living or dead, is purely coincidental. 

Please enjoy this transcription for what it is, a reproduction of the original work, and please do not use it as a substitute for the original work.</s>

It seems like a non-typical summary of Harry Potter :( But it's much better than outputs of 7b models!

I'll keep working on it, and update what I found in time.

kekekawaii2839 commented 1 year ago

Update: Command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-13b-chat-hf \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 16 \
    --index_devices 2 --datastore_device 3 --stream_output --seed 114514

Output:

=== GENERATED SEQUENCE 1 (input length: 131315) ===
|||  This story is an excerpt from the book "Harry Potter and the Sorcerer's Stone" by J.K. Rowling. The story follows Harry Potter, a young boy who discovers he is a wizard, as he attends Hogwarts School of Witchcraft and Wizardry. In this excerpt, Harry and his friends Ron and Hermione, must face the challenges of their third year at Hogwarts, including a dangerous game of chess, a obstacle course, and a trip to the Forbidden Forest. Along the way, they must contend with the evil Lord Voldemort and his followers, and navigate the complexities of friendship and growing up. The excerpt ends with Harry, Ron, and Hermione on the Hogwarts Express, ready to head home for the summer.</s>

That's weird, the model's output seems to use plots of other harry potter book?

Command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python src/run_generation.py --model_type llama --model_name_or_path stabilityai/StableBeluga-13B \
    --prefix "### System:\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n### User:\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " ### Assistant" --test_unlimiformer --fp16 --length 200 --layer_begin 22 \
    --index_devices 2 --datastore_device 3 --stream_output

Output:

=== GENERATED SEQUENCE 1 (input length: 131310) ===
||| Harry Potter: The Philosopher's Stone Audiobook Part 3</s>

SharkWipf commented 1 year ago

That's weird, the model's output seems to use plots of other harry potter book?

Yeah, using one of world's most famous and well-documented book series as a test for a general purpose trained LLM might not be the most reliable test :wink: It would be better to find something well-known to you but not known to the base llama model, so it can't cheat by pulling in information from its internal knowledge.

urialon commented 1 year ago

Yes, I agree with @SharkWipf , The best evaluation would be a book that you have read but came out after Llama-2's knowledge cutoff.

@kekekawaii2839 , can you share what modification you made to the code and what was the problem that it solved?

kekekawaii2839 commented 1 year ago

Yeah, using one of world's most famous and well-documented book series as a test for a general purpose trained LLM might not be the most reliable test 😉 It would be better to find something well-known to you but not known to the base llama model, so it can't cheat by pulling in information from its internal knowledge.

Agree. And I randomly picked a book in Books3 and test llama-2 with the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-13b-chat-hf \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book: " \
    --prompt example_inputs/1.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 20 \
    --index_devices 2 --datastore_device 3 --stream_output --seed 8888

And here's the output:

=== GENERATED SEQUENCE 1 (input length: 114693) ===
||| This is a comprehensive list of all the PostgreSQL-related concepts and techniques that are covered in the provided documentation. It's a thorough resource for learning about PostgreSQL programming and administration.

The list includes the following:

1. About PostgreSQL, including its history, features, and community support.
2. Creating a PostgreSQL database and connecting to it.
3. Creating tables, including primary keys, foreign keys, and indexes.
4. Data types, including integer, boolean, date and time types, and geometry and geography types.
5. Querying the database using SELECT statements, including SELECT statements with joins and subqueries.
6. Data manipulation using INSERT, UPDATE, and DELETE statements.
7. Data modification using TRUNCATE statements.
8. Indexes and partitioning to optimize queries.
9. Streaming data into and out of PostgreSQL using file-based

(The summary is incomplete due to max length limit)

@kekekawaii2839 , can you share what modification you made to the code and what was the problem that it solved?

Sure, in my case, my problem was that I only have A100 40G gpu, so I have no chance running 13B or bigger models on a single gpu. After googling, I found that using Huggingface's Accelerate, I can distribute the model to multiple gpus. To implement that, just add device_map='sequential' to line 446 in run_generation.py and run.

model = model_class.from_pretrained(args.model_name_or_path, device_map='sequential', **model_kwargs)

Then I found a series of errors, which are like multiplying tensors that are on different gpus. So I just add lots of tensor1.to(tensor2.device) according to the traceback and solve that :)

kekekawaii2839 commented 1 year ago

P.S. For now, when I run run_generation.py on multiple gpus, there's a small chance that the process is stuck in line 429 in unlimiformer.py.

concat_hidden_states.append(torch.cat(self.hidden_states[i], axis=1))

~~And I really have no clue of the reason, because it seems to happen randomly even I don't change any input or flag.~~ Good, it's not the problem with the code. Solved.

kekekawaii2839 commented 1 year ago

Update

Command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python src/run_generation.py --model_type llama --model_name_or_path meta-llama/Llama-2-13b-chat-hf \
    --prefix "<<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n [INST] Summarize the following book and start with ‘This book tells a story about’ : " \
    --prompt example_inputs/harry_potter_full.txt \
    --suffix " [/INST]" --test_unlimiformer --fp16 --length 200 --layer_begin 20 \
    --index_devices 2 --datastore_device 3 --stream_output --seed 8848

Output:

=== GENERATED SEQUENCE 1 (input length: 131326) ===
|||  This is the end of the first book of the Harry Potter series. The story follows Harry Potter's first year at Hogwarts School of Witchcraft and Wizardry, where he learns about magic and the wizarding world. Along the way, he makes friends with Ron Weasley and Hermione Granger and has to face the challenges of the wizarding world, including the evil Lord Voldemort. 

The story concludes with Harry and his friends leaving Hogwarts for the summer holidays. Uncle Vernon and Aunt Petunia are there to take Harry home, but they do not understand the wizarding world and are rude to Harry. 

This first book introduces the main characters and the wizarding world, setting the stage for the subsequent books in the series. 

I hope this helps! Let me know if you have any other questions. </s>

Although the model didn't follow my instruction strictly, it performed better. And I found that the flag --seed matters.

abertsch72 / unlimiformer

running unlimiformer inference on multiple gpus #29