ai-forever / MERA

MERA (Multimodal Evaluation for Russian-language Architectures) is a new open benchmark for the Russian language for evaluating fundamental models.
MIT License
49 stars 8 forks source link

Большие модели, не влезающие в одну карту, не параллелятся на несколько #19

Closed preduct0r closed 1 month ago

preduct0r commented 1 month ago

Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s] Loading checkpoint shards: 5%|▌ | 1/19 [00:01<00:24, 1.37s/it] Loading checkpoint shards: 11%|█ | 2/19 [00:03<00:30, 1.80s/it] Loading checkpoint shards: 16%|█▌ | 3/19 [00:06<00:37, 2.35s/it] Loading checkpoint shards: 21%|██ | 4/19 [00:11<00:51, 3.41s/it] Loading checkpoint shards: 26%|██▋ | 5/19 [00:17<00:58, 4.19s/it] Loading checkpoint shards: 32%|███▏ | 6/19 [00:21<00:57, 4.38s/it] Loading checkpoint shards: 37%|███▋ | 7/19 [00:27<00:58, 4.87s/it] Loading checkpoint shards: 42%|████▏ | 8/19 [00:34<01:00, 5.51s/it] Loading checkpoint shards: 47%|████▋ | 9/19 [00:38<00:51, 5.12s/it] Loading checkpoint shards: 53%|█████▎ | 10/19 [00:44<00:47, 5.29s/it] Loading checkpoint shards: 58%|█████▊ | 11/19 [00:51<00:47, 5.89s/it] Loading checkpoint shards: 63%|██████▎ | 12/19 [00:54<00:34, 4.95s/it]slurmstepd: error: JOB 2971874 ON sc34 CANCELLED AT 2024-05-12T23:49:39 slurmstepd: error: Detected 1 oom_kill event in StepId=2971874.batch. Some of the step tasks have been OOM Killed.

ситуация одинакова для 1,2,3 карт A100. Модель https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

LSinev commented 1 month ago

Опишите, пожалуйста, подробнее, как запускали. Чтобы мы смогли воспроизвести и попробовать подебажить. И протестируйте ещё на новой ветке https://github.com/ai-forever/MERA/tree/update/new_harness_codebase

preduct0r commented 1 month ago

Помогло здесь выставить device_map="auto" и torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16 https://github.com/ai-forever/MERA/blob/main/lm-evaluation-harness/lm_eval/models/huggingface.py#L291