IVGSZ / Flash-VStream

This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"
https://invinciblewyq.github.io/vstream-page/
Apache License 2.0
110 stars 7 forks source link

Results after stage 1 #12

Closed leebebeto closed 1 week ago

leebebeto commented 1 month ago

Hi, thank you for the wonderful work. I pretrained stage 1 using image-caption pairs only (LLaVA-filtered-558K).

In your paper, it says it takes about 15 hours for overall training. Does it mean that it takes total of 15 hours for stage 1 and stage 2? Or does it mean 15 hours for each stage?

image

I used 4 A100 (80G) GPUs and I found that the training ends very quick. It took me about two hours for stage 1 to finish.

image

Also, I get loss scales of 2.xx. Is this loss values similar to what you've seen? Or does it needs to be converged more?

Thank you in advance!

zhang9302002 commented 1 week ago

Hello! The overall training takes about 15h, 5h for stage-1, 10h for stage-2 on our 8*A100 machine. It is reasonable that you finish pretraining quickly, because your pretraining is without video data. 2.xx is similar to our pretrain loss.