How to train on both image data and text-only data together?

Hi !

In your paper, you mentioned that including text-only data in training is crucial for maintaining language abilities. I'm currently performing full fine-tuning using LLaMA Factory, and I'm encountering an issue.

I'm trying to perform full fine-tuning of Qwen2-VL in llama-factory, using the following dataset_info.json. The "aihub_charts" dataset contains both images and text, while "alignment_text_only_150_2" is a text-only dataset.

{
  "aihub_charts": {
    "file_name": "aihub_charts.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages",
      "images": "images"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  },
  "alignment_text_only_150_2": {
    "file_name": "alignment_text_only_150_test_2.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  }
}

The issue I'm facing is that when I attempt to run full fine-tuning with both datasets combined, the training process doesn't throw an error, but it just halts unexpectedly.

Upon debugging, I found that during module forwarding in torch/nn/modules/module.py, specifically at the following line: result = forward_call(*args, **kwargs)

The process stops at a certain point in the forward pass, as shown below (these are the logs I manually output):

Starting forward call for Qwen2VLForConditionalGeneration
Starting forward call for Embedding
Finished forward call for Embedding
Starting forward call for Qwen2VisionTransformerPretrainedModel
Starting forward call for PatchEmbed
Starting forward call for Conv3d
Finished forward call for Conv3d
Finished forward call for PatchEmbed
Starting forward call for VisionRotaryEmbedding
Finished forward call for VisionRotaryEmbedding
Starting forward call for Qwen2VLVisionBlock
Starting forward call for LayerNorm
Finished forward call for LayerNorm
Starting forward call for VisionSdpaAttention
Starting forward call for Linear
Finished forward call for Linear
Starting forward call for Linear
Finished forward call for Linear
Finished forward call for VisionSdpaAttention
Starting forward call for LayerNorm
Finished forward call for LayerNorm
Starting forward call for VisionMlp
Starting forward call for Linear
Finished forward call for Linear
Starting forward call for QuickGELUActivation
Finished forward call for QuickGELUActivation
Starting forward call for Linear
Finished forward call for Linear
Finished forward call for VisionMlp
Finished forward call for Qwen2VLVisionBlock
Starting forward call for Qwen2VLVisionBlock
... [Repeats for several blocks]
Starting forward call for VisionSdpaAttention
Starting forward call for Linear
Finished forward call for Linear
Starting forward call for Linear
Finished forward call for Linear

It seems to hang indefinitely during these forward passes without any error message.

An important observation is that training proceeds normally when I run it separately on either "aihub_charts" or "alignment_text_only_150_2" individually.

How did you train on pure text data with image-text pair data? Did you pass dummy images for the text-only data, or is there another approach? I'd appreciate any insights into your method for incorporating text-only data in the training process

When switching from Deepspeed3 to Deepspeed2, I encountered the following error messages:

[rank7]:[E1010 23:03:50.679013725 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=747, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
[rank7]:[E1010 23:03:50.679216212 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 747, last enqueued NCCL work: 747, last completed NCCL work: 746.
[rank6]:[E1010 23:03:50.744551430 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=747, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600074 milliseconds before timing out.
[rank6]:[E1010 23:03:50.744840039 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 747, last enqueued NCCL work: 747, last completed NCCL work: 746.
...
[rank7]:[E1010 23:03:51.824423351 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E1010 23:03:51.824429498 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down.

It seems like the issue may be related to data inconsistency. I would like to ask if anyone has faced similar challenges or has insights into how they approached training under these circumstances. How did you structure your training to avoid these errors?

QwenLM / Qwen2-VL

How to train on both image data and text-only data together? #346