Clarification Needed on Utillization of Tokenization in the Fine-Tuning Module || InternLM-XComposer2d5

khyati2396 commented 2 weeks ago

Hello Fellow Developers,

I am working on implementing the evaluation code in the current fine-tuning module and noticed something regarding the tokenizer.

While the tokenizer is passed into the make_supervised_data_module function, it doesn't seem to be utilized in the DataCollatorForSupervisedDataset.

Since DataCollatorForSupervisedDataset serves as the custom data collator, if the tokenizer isn’t used there, what is being employed for tokenization? This brings up the concern of whether the fine-tuning script is functioning as intended.

Could you please clarify this?

> Also, when are you planning to release the evaluation code?

Thanks in Advance.

yuhangzang commented 2 weeks ago

The tokenizer is defined in modeling_internlm_xcomposer2.py.
You can use VLMEvalKit for evaluation.

khyati2396 commented 2 weeks ago

Thanks for the response @yuhangzang This makes sense.

I have a few more questions. What are the GPU requirements for the full-finetuning? What all parameters do I need to change for distributed GPU finetuning?

I am failing to use multi-GPUs for training.

Case 1:

I tried the Lora fintuning on the sample dataset on the single A100. Lora-finetuning works on a single 80GB A100 machine. The parameters I changed were as below.

GPUS_PER_NODE=1 ## previous value was 8 NNODES=1 NODE_RANK=0 MASTER_ADDR=localhost MASTER_PORT=6001

This works properly.

Case 2:

I have 8 X L4 machines. (23 GBs X 8 = 184 GBs of GPU memory) I keep getting below error when I try with GPUS_PER_NODE = 1/2/3/4/5/6/7 Value of NNODES is still 1.

[2024-08-27 11:37:21,292] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 200000000 [2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 200000000 [2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False [2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False rank0: Traceback (most recent call last): rank0: File "/home/karan/tasks_by_petpooja/internLM_xcomposer2_5/finetune/finetune.py", line 336, in

rank0: File "/home/karan/tasks_by_petpooja/internLM_xcomposer2_5/finetune/finetune.py", line 326, in train

rank0: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train rank0: return inner_training_loop( rank0: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/transformers/trainer.py", line 1682, in _inner_training_loop rank0: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) rank0: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/accelerate/accelerator.py", line 1303, in prepare rank0: result = self._prepare_deepspeed(*args) rank0: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/accelerate/accelerator.py", line 1779, in _preparedeepspeed rank0: engine, optimizer, , lr_scheduler = deepspeed.initialize(**kwargs) rank0: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/init.py", line 181, in initialize rank0: engine = DeepSpeedEngine(args=args, rank0: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 306, in init rank0: self._configure_optimizer(optimizer, model_parameters) rank0: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1250, in _configure_optimizer rank0: self.optimizer = self._configure_zero_optimizer(basic_optimizer) rank0: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1508, in _configure_zero_optimizer rank0: optimizer = DeepSpeedZeroOptimizer( rank0: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 393, in init rank0: weights_partition = self.parallel_partitioned_bit16_groups[i][partition_id].to( rank0: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.13 GiB. GPU

What changes do I need to make for this to work?

yuhangzang commented 2 weeks ago

Our code is tested in 8 A100 GPUs (80GB). You may set a small value of hd_num to save the GPU memory.

InternLM / InternLM-XComposer

Clarification Needed on Utillization of Tokenization in the Fine-Tuning Module || InternLM-XComposer2d5 #431

Case 1:

Case 2: