Cannot finetune due to GPU OOM error

I got GPU OOM error when trying to finetune embedder model on Kaggle (using GPU T4 x 2)

This is my run command (already reduce query_max_len and passage_max_len):

!WANDB_DISABLED=True WANDB_MODE="disabled" torchrun --nproc_per_node 1 \
    -m FlagEmbedding.finetune.embedder.encoder_only.m3 \
    --model_name_or_path BAAI/bge-m3 \
    --cache_dir ./cache/model \
    --train_data /kaggle/input/bge-m3/postprocessed_training_bge_m3_data_score.jsonl \
    --cache_path ./cache/data \
    --train_group_size 6 \
    --query_max_len 64 \
    --passage_max_len 392 \
    --pad_to_multiple_of 8 \
    --knowledge_distillation True \
    --same_dataset_within_batch True \
    --small_threshold 0 \
    --drop_threshold 0 \
    --output_dir ./test_encoder_only_m3_bge-m3_sd \
    --overwrite_output_dir \
    --learning_rate 1e-5 \
    --fp16 \
    --num_train_epochs 8 \
    --per_device_train_batch_size 1 \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --gradient_checkpointing \
    --deepspeed /kaggle/working/FlagEmbedding/examples/finetune/ds_stage0.json \
    --logging_steps 1 \
    --save_steps 1000 \
    --negatives_cross_device \
    --temperature 0.02 \
    --sentence_pooling_method cls \
    --normalize_embeddings True \
    --kd_loss_type m3_kd_loss \
    --unified_finetuning False \
    --fix_encoder True \
    --use_self_distill True \
    --self_distill_start_step 0

I got OOM error at training start:

[2024-11-13 04:31:54,204] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-13 04:31:56,101] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-13 04:31:56,101] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Fetching 30 files: 100%|█████████████████████| 30/30 [00:00<00:00, 96866.14it/s]
Using /root/.cache/torch_extensions/py310_cu123 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu123/fused_adam/build.ninja...
/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.26943278312683105 seconds
[2024-11-13 04:32:01,219] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started
  0%|                                                | 0/123824 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2888: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/finetune/embedder/encoder_only/m3/__main__.py", line 22, in <module>
[rank0]:     runner.run()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/abc/finetune/embedder/AbsRunner.py", line 119, in run
[rank0]:     self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2188, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
[rank0]:     self.engine.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step
[rank0]:     self._take_model_step(lr_kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2119, in _take_model_step
[rank0]:     self.optimizer.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/fp16/fused_optimizer.py", line 286, in step
[rank0]:     _flatten_dense_tensors([
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 510, in _flatten_dense_tensors
[rank0]:     return torch._C._nn.flatten_dense_tensors(tensors)
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.12 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.33 GiB is free. Process 8413 has 13.41 GiB memory in use. Of the allocated memory 10.61 GiB is allocated by PyTorch, and 2.10 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
  0%|          | 0/123824 [00:01<?, ?it/s]                                      
[rank0]:[W1113 04:32:03.108439288 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
E1113 04:32:05.396000 137988924520256 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1153) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
FlagEmbedding.finetune.embedder.encoder_only.m3 FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-13_04:32:05
  host      : 21f7f035e960
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1153)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@hanhainebula thank you for your comments, I updated the command as you suggested but OOM error still occurs:

!WANDB_DISABLED=True WANDB_MODE="disabled" torchrun --nproc_per_node 2 \
    -m FlagEmbedding.finetune.embedder.encoder_only.m3 \
    --model_name_or_path BAAI/bge-m3 \
    --cache_dir ./cache/model \
    --train_data /kaggle/input/bge-m3-bidv/postprocessed_training_bge_m3_data_score.jsonl \
    --cache_path ./cache/data \
    --train_group_size 6 \
    --query_max_len 64 \
    --passage_max_len 392 \
    --pad_to_multiple_of 8 \
    --knowledge_distillation True \
    --same_dataset_within_batch True \
    --small_threshold 0 \
    --drop_threshold 0 \
    --output_dir ./test_encoder_only_m3_bge-m3_sd \
    --overwrite_output_dir \
    --learning_rate 1e-5 \
    --fp16 \
    --num_train_epochs 8 \
    --per_device_train_batch_size 1 \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --gradient_checkpointing \
    --deepspeed /kaggle/working/FlagEmbedding/examples/finetune/ds_stage0.json \
    --logging_steps 1 \
    --save_steps 1000 \
    --negatives_cross_device False\
    --temperature 0.02 \
    --sentence_pooling_method cls \
    --normalize_embeddings True \
    --kd_loss_type m3_kd_loss \
    --unified_finetuning False \
    --fix_encoder False \
    --use_self_distill True \
    --self_distill_start_step 0

W1114 06:46:26.545000 136249064138560 torch/distributed/run.py:779] 
W1114 06:46:26.545000 136249064138560 torch/distributed/run.py:779] *****************************************
W1114 06:46:26.545000 136249064138560 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1114 06:46:26.545000 136249064138560 torch/distributed/run.py:779] *****************************************
[2024-11-14 06:46:36,055] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-14 06:46:36,056] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-14 06:46:38,376] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-14 06:46:38,376] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-11-14 06:46:38,395] [INFO] [comm.py:652:init_distributed] cdb=None
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Fetching 30 files: 100%|████████████████████| 30/30 [00:00<00:00, 129188.01it/s]
Fetching 30 files: 100%|█████████████████████| 30/30 [00:00<00:00, 91645.39it/s]
Using /root/.cache/torch_extensions/py310_cu123 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu123 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu123/fused_adam/build.ninja...
/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.3210275173187256 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.40282130241394043 seconds
[2024-11-14 06:46:44,158] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started
[2024-11-14 06:46:44,239] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started
  0%|                                                 | 0/61912 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2888: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
  warnings.warn(
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2888: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 1.1245, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.0}            
  0%|                                      | 1/61912 [00:01<22:07:27,  1.29s/it][rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank1]:     return _run_code(code, main_globals, None,
[rank1]:   File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
[rank1]:     exec(code, run_globals)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/finetune/embedder/encoder_only/m3/__main__.py", line 22, in <module>
[rank1]:     runner.run()
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/abc/finetune/embedder/AbsRunner.py", line 119, in run
[rank1]:     self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
[rank1]:     self.accelerator.backward(loss, **kwargs)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2188, in backward
[rank1]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
[rank1]:     self.engine.step()
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step
[rank1]:     self._take_model_step(lr_kwargs)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2119, in _take_model_step
[rank1]:     self.optimizer.step()
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/fp16/fused_optimizer.py", line 286, in step
[rank1]:     _flatten_dense_tensors([
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 510, in _flatten_dense_tensors
[rank1]:     return torch._C._nn.flatten_dense_tensors(tensors)
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.12 GiB. GPU 1 has a total capacity of 14.74 GiB of which 1.80 GiB is free. Process 7131 has 12.94 GiB memory in use. Of the allocated memory 10.61 GiB is allocated by PyTorch, and 2.10 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/finetune/embedder/encoder_only/m3/__main__.py", line 22, in <module>
[rank0]:     runner.run()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/abc/finetune/embedder/AbsRunner.py", line 119, in run
[rank0]:     self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2188, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
[rank0]:     self.engine.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step
[rank0]:     self._take_model_step(lr_kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2119, in _take_model_step
[rank0]:     self.optimizer.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/fp16/fused_optimizer.py", line 286, in step
[rank0]:     _flatten_dense_tensors([
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 510, in _flatten_dense_tensors
[rank0]:     return torch._C._nn.flatten_dense_tensors(tensors)
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.12 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.80 GiB is free. Process 7130 has 12.94 GiB memory in use. Of the allocated memory 10.61 GiB is allocated by PyTorch, and 2.10 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
  0%|          | 1/61912 [00:03<52:11:35,  3.03s/it]                            
[rank0]:[W1114 06:46:47.765021971 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
E1114 06:46:49.731000 136249064138560 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 725) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
FlagEmbedding.finetune.embedder.encoder_only.m3 FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-11-14_06:46:49
  host      : 0a4bb6d91554
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 726)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-14_06:46:49
  host      : 0a4bb6d91554
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 725)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

FlagOpen / FlagEmbedding

Cannot finetune due to GPU OOM error #1218