OpenGPT4o Multimodality Project

anonymous-atom commented 1 month ago

Sorry for any inconvencience brought to you like this. I am research student currently at Georgia Tech, working on Mulitmodals and had been working on around NExT-GPT along with advisor from Reka.ai and NVIDIA for long now.

Could it be possible if you can help us with the doubt regarding mulitmodals here? That will mean a lot!

Your work on NExT-GPT Is really nice.

Thank Again

ChocoWu commented 1 month ago

Sure! I’d be happy to help if I can. However, I’m not sure what your specific doubt is—could you clarify?

anonymous-atom commented 1 month ago

Thank you very much for that! I have several doubts regarding training and interleaved generation. I will post them in a while.

Could it be possible to collaborate over email or some of your preferred way to reach ? Just want to keep the work private before it's published.

My email: ksharma370@gatech.edu

Thanks Again!

ChocoWu commented 1 month ago

my email: swu@u.nus.edu

anonymous-atom commented 1 month ago

Got it Thanks! Will keep posting updates and doubts here

anonymous-atom commented 1 month ago

Hi @ChocoWu Let me know if I am wrong, we have to train the Stage 3 before stage 2 ? if so , then your latest code requires "multimodal_projector" checkpoints, but that we only get if we train using stage 2 first ?

ChocoWu commented 1 month ago

I did not understand your question. In the new version's codebase, there is no "multimodal_projector." Please refer to the latest README for details on training.

anonymous-atom commented 3 weeks ago

Hi @ChocoWu, really sorry to reach again this way, but had a doubt regarding training:

During training of the decoder, either of compute_image_loss, compute_video_loss, compute_audio_loss is None, is this because during training the dataloader passes data of 1 modality through forward pass ? Is this the expected behaviour ?

because the loss fluctuates alot:

=== compute_image_loss : (30.875, 30.875, None) ====== === compute_video_loss : (None, None, None) ====== ===compute_audio_loss : (None, None, None) ====== {'loss': 38.7826, 'grad_norm': 47.2336196899414, 'learning_rate': 6.505283282093679e-06, 'epoch': 0.0}

=== compute_image_loss : (None, None, None) ====== === compute_video_loss : (None, None, None) ====== ===compute_audio_loss : (2.421875, 2.421875, 0.3259783685207367) ====== {'loss': 6.0065, 'grad_norm': 111.6504898071289, 'learning_rate': 6.1332575503138975e-06, 'epoch': 0.0}

anonymous-atom commented 3 weeks ago

@ChocoWu Also can you kindly release the latest NExT-GPT weights ?

ChocoWu commented 3 weeks ago

Hi, I'm sorry for the late response. I'm too busy these two weeks to chase a deadline.

Yes. this is an expected behavior. You might also try mixing data from different modalities to see how it performs.

Hi @ChocoWu, really sorry to reach again this way, but had a doubt regarding training:

During training of the decoder, either of compute_image_loss, compute_video_loss, compute_audio_loss is None, is this because during training the dataloader passes data of 1 modality through forward pass ? Is this the expected behaviour ?

because the loss fluctuates alot:

=== compute_image_loss : (30.875, 30.875, None) ====== === compute_video_loss : (None, None, None) ====== ===compute_audio_loss : (None, None, None) ====== {'loss': 38.7826, 'grad_norm': 47.2336196899414, 'learning_rate': 6.505283282093679e-06, 'epoch': 0.0}

=== compute_image_loss : (None, None, None) ====== === compute_video_loss : (None, None, None) ====== ===compute_audio_loss : (2.421875, 2.421875, 0.3259783685207367) ====== {'loss': 6.0065, 'grad_norm': 111.6504898071289, 'learning_rate': 6.1332575503138975e-06, 'epoch': 0.0}

anonymous-atom commented 3 weeks ago

Thanks for your response. Wishing you luck for CVPR if that's what you are chasing for!

On Sat, Nov 2, 2024, 20:45 Shengqiong Wu @.***> wrote:

Hi, I'm sorry for the late response. I'm too busy these two weeks to chase a deadline.

Yes. this is an expected behavior. You might also try mixing data from different modalities to see how it performs.

Hi @ChocoWu https://github.com/ChocoWu, really sorry to reach again this way, but had a doubt regarding training:

During training of the decoder, either of compute_image_loss, compute_video_loss, compute_audio_loss is None, is this because during training the dataloader passes data of 1 modality through forward pass ? Is this the expected behaviour ?

because the loss fluctuates alot:

=== compute_image_loss : (30.875, 30.875, None) ====== === compute_video_loss : (None, None, None) ====== ===compute_audio_loss : (None, None, None) ====== {'loss': 38.7826, 'grad_norm': 47.2336196899414, 'learning_rate': 6.505283282093679e-06, 'epoch': 0.0}

=== compute_image_loss : (None, None, None) ====== === compute_video_loss : (None, None, None) ====== ===compute_audio_loss : (2.421875, 2.421875, 0.3259783685207367) ====== {'loss': 6.0065, 'grad_norm': 111.6504898071289, 'learning_rate': 6.1332575503138975e-06, 'epoch': 0.0}

— Reply to this email directly, view it on GitHub https://github.com/anonymous-atom/OpenGPT4o/issues/1#issuecomment-2453243115, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZTQIOSY7E4GCUGB2WEUETZ6VW25AVCNFSM6AAAAABPTRCIZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJTGI2DGMJRGU . You are receiving this because you authored the thread.Message ID: @.***>

anonymous-atom / OpenGPT4o

OpenGPT4o Multimodality Project #1