IAAR-Shanghai / CRUD_RAG

CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models
https://arxiv.org/abs/2401.17043
241 stars 20 forks source link

Low Evaluation Speedup for Local Model #13

Closed mi-iro closed 4 months ago

mi-iro commented 4 months ago

Dear authors, When I used the CRUD-RAG test on GPT-3.5-Turbo and designed it with 20 threads, the test could be completed within half an hour.

2024-07-10 14:57:44.167 | INFO | main::56 - Namespace(model_name='gpt-3.5-turbo', temperature=0.1, max_new_tokens=1280, data_path='data/crud_split/split_merged.json', shuffle=True, embedding_name='sentence-transformers/bge-base-zh-v1.5', embedding_dim=768, docs_path='data/80000_docs', docs_type='txt', chunk_size=128, chunk_overlap=0, construct_index=False, add_index=False, collection_name='docs_80k_chuncksize_128_0', retrieve_top_k=8, retriever_name='base', quest_eval=False, bert_score_eval=False, task='continuing_writing', num_threads=20, show_progress_bar=True, contain_original_data=False) LLM is explicitly disabled. Using MockLLM. 3%|████▊ | 63/2000 [01:09<20:02, 1.61it/s

However, when using the local model Qwen7b, it seems that concurrent threading doesn't make much of a difference; the GPU utilization is very low, and it's unclear how long the test will take to finish :(

2024-07-10 14:53:47.739 | INFO | main::56 - Namespace(model_name='qwen7b', temperature=0.1, max_new_tokens=1280, data_path='data/crud_split/split_merged.json', shuffle=True, embedding_name='sentence-transformers/bge-base-zh-v1.5', embedding_dim=768, docs_path='data/80000_docs', docs_type='txt', chunk_size=128, chunk_overlap=0, construct_index=False, add_index=False, collection_name='docs_80k_chuncksize_128_0', retrieve_top_k=8, retriever_name='base', quest_eval=False, bert_score_eval=False, task='continuing_writing', num_threads=20, show_progress_bar=True, contain_original_data=False) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 4/4 [00:10<00:00, 2.59s/it] LLM is explicitly disabled. Using MockLLM. 0%| | 0/2000 [00:00<?, ?it/s]

UPDATE: 1%|█ | 24/2000 [36:50<28:58:46, 52.80s/it]

$nvidia-smi Wed Jul 10 15:00:41 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H800 PCIe Off | 00000000:0E:00.0 Off | 0 | | N/A 38C P0 84W / 350W | 11902MiB / 81559MiB | 1% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H800 PCIe Off | 00000000:0F:00.0 Off | 0 | | N/A 39C P0 81W / 350W | 8017MiB / 81559MiB | 1% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H800 PCIe Off | 00000000:10:00.0 Off | 0 | | N/A 37C P0 75W / 350W | 8023MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H800 PCIe Off | 00000000:12:00.0 Off | 0 | | N/A 38C P0 81W / 350W | 8023MiB / 81559MiB | 1% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H800 PCIe Off | 00000000:87:00.0 Off | 0 | | N/A 38C P0 80W / 350W | 8023MiB / 81559MiB | 1% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H800 PCIe Off | 00000000:88:00.0 Off | 0 | | N/A 38C P0 76W / 350W | 7991MiB / 81559MiB | 1% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H800 PCIe Off | 00000000:89:00.0 Off | 0 | | N/A 39C P0 81W / 350W | 10159MiB / 81559MiB | 1% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H800 PCIe Off | 00000000:8A:00.0 Off | 0 | | N/A 39C P0 79W / 350W | 461MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 907595 C python 8984MiB | | 0 N/A N/A 914610 C python 2904MiB | | 1 N/A N/A 907595 C python 8008MiB | | 2 N/A N/A 907595 C python 8014MiB | | 3 N/A N/A 907595 C python 8014MiB | | 4 N/A N/A 907595 C python 8014MiB | | 5 N/A N/A 907595 C python 7982MiB | | 6 N/A N/A 907595 C python 10150MiB | | 7 N/A N/A 907595 C python 452MiB | +---------------------------------------------------------------------------------------+

haruhi-sudo commented 4 months ago

Hello. The local_model code in this repository is only for the purpose of enabling everyone to quickly use this evaluation framework, and does not support multi-gpu acceleration and other functions. It is recommended to encapsulate the Qwen and other local models as API and use remote_model to call them.