when training new model, I got stuck in the middle of the training

PKU-YuanGroup / Video-LLaVA

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

https://arxiv.org/pdf/2311.10122.pdf

Apache License 2.0

2.85k stars 206 forks source link

when training new model, I got stuck in the middle of the training #107

Open sunwhw opened 7 months ago

sunwhw commented 7 months ago

Hi, have you ever encountered a problem when training models to support different sizes or different frames but always got stuck in the middle of the training? I checked the logs and it looks like there was a communication problem with Deepspeed during Gradient reduce?

LinB203 commented 7 months ago

What's your datasets? Is your customed dataset?

sunwhw commented 7 months ago

Data: yes, and these data can be trained normally before modifying the model to support different frame. model: I also cut the len of data to 8, and it run normal and successfully so I think the data and model is normal, which makes me confused to fix the bug.

ciroimmobile commented 3 months ago

have you ever fix this problem？