Question about the ablation

Richar-Du commented 1 year ago

Thanks for your awesome work! VisionLLM opens a way towards a generalist vision and language model.

However, from the result in the single task vs. multiple tasks in ablation study, it seems that multi-task training hurts the performance, what do you think caused this? Is the training data not large enough? OFA also introduce coordinate tokens and find that multi-task learning can improve performance. Thanks in advance :)

czczup commented 1 year ago

Hi, thanks for this question and apologize for the delayed response. Regarding the performance degradation observed in multi-task training, several factors could contribute to this result. First, we only used COCO data, which may not be enough; Second, it may be that multi-task training requires a longer training schedule to achieve comparable performance; Third, sharing parameters for multi-task training exists the task-interference issue.

As described in UniPerceiver-MoE:

Compared to specialized models with specific parameters for each task, generalist models with shared parameters would suffer from the task-interference issue — different tasks with shared parameters may conflict with each other [88]. The same issue is also observed in multilingual NLP models [4, 81, 83]. We argue that the task-interference issue is mainly caused by the inconsistent optimization in multi-task learning. As shown in Tab. 1, during the training phase of generalist models, the gradient directions of different tasks would be inconsistent or even opposite. Thus, if multiple tasks share parameters, the optimal update direction of the shared parameters will be uncertain, resulting in sub-optimal performance.

Richar-Du commented 1 year ago

OK, thanks for your reply :)

OpenGVLab / VisionLLM

Question about the ablation #1