Results after stage 1 - Githubissues

IVGSZ / Flash-VStream

This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"

Apache License 2.0

110 stars 7 forks source link

Hi, thank you for the wonderful work. I pretrained stage 1 using image-caption pairs only (LLaVA-filtered-558K).

In your paper, it says it takes about 15 hours for overall training. Does it mean that it takes total of 15 hours for stage 1 and stage 2? Or does it mean 15 hours for each stage?

I used 4 A100 (80G) GPUs and I found that the training ends very quick. It took me about two hours for stage 1 to finish.

Also, I get loss scales of 2.xx. Is this loss values similar to what you've seen? Or does it needs to be converged more?

Thank you in advance!

IVGSZ / Flash-VStream

Results after stage 1 #12