MCG-NKU / E2FGVI

Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)
Other
1.04k stars 100 forks source link

loss explosion when training on custom Dataset #75

Open LokiXun opened 1 year ago

LokiXun commented 1 year ago

Hi, it an awesome work! May I ask some help, I met some problems when training the model on REDS video dataset. When the training elapses about 40K iterations, the loss suddenly explode and the predict image became un-identifiable.
image ps: the loss value show in picture is the sum of last 100 iterations

In order to run this dataset, I do the following modifications:

  1. Dataset: the frame_size=1280x72 100 frames video. I crop them to 256x256 and add random blur. I use 7 local frames and 5 reference frames (which is equally sample from whole video except the local frame region). My objective is to deblur so i do not use the mask to cover the origin image
  2. In order to train, I modify the SoftSplit and Tansformer's parameter: output_size = (64, 64) in this line and small_window_size = (11, 11) to match the [12, 22, 22, 512] size feature out of Softsplit.
  3. I set no_dis: 1 in config file to not using the adversarial loss and gan_loss, I thought it may cause training unstable so I dismiss it
  4. I only have one 24G-memory 4090 GPU so I could only set batchsize=1 and I did not change the scheduler which means the learning rate for the whole time is 1e-4.

the predict result at the loss-explosion iteration is like grid_164_39300_030_00000000 ps: the first row: first 7 pic is local frames and latter 5 pic are non-local image; second row is correspinding GT. 3rd row is model's prediction

Does I mistakenly modified the param in TimeFocalTransformer? Have u guys have simiar issue and how u solve it, thanks.

asfaukas commented 8 months ago

Dear @LokiXun, have you solved this problem? The loss function increase at about 40k iterations.

stayhungry1 commented 7 months ago

Is the loss increase from the DCN layer in the training

Paper99 commented 7 months ago

Yes, the simple solution is resuming from a non-crashed checkpoint.

stayhungry1 @.***> 于2024年4月9日周二 15:06写道:

Is the loss increase from the DCN layer in the training

— Reply to this email directly, view it on GitHub https://github.com/MCG-NKU/E2FGVI/issues/75#issuecomment-2044285166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFATMT4W4KBMUT7UF2AORM3Y4OHNZAVCNFSM6AAAAAA4LMAUZ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBUGI4DKMJWGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>