avinashpaliwal / Super-SloMo

PyTorch implementation of Super SloMo by Jiang et al.
MIT License
2.99k stars 476 forks source link

Runtime Error (Traceback Attached) #106

Open Craig-Leko opened 2 years ago

Craig-Leko commented 2 years ago

Thank you very much for this project. I love it, and I've been using it successfully since early January. However, in the past week or so, it has stopped working on every colab fork I've tried. Perhaps there has been an incompatible version update in a dependency somewhere?

Describe the bug After loading the dependencies, I input the file path to the uploaded video I want to interpolate. On execution, I get the attached error, which is a standard error being thrown by the process.communicate() method in ().

To Reproduce Behavior happens every time. Colab is not running out of memory, as I use CO Pro+ and I have taken the images down to 100x100 pixels and 50 total frames to ensure the file is sufficiently small to rule out running out of RAM as the issue.

  1. Visit Colab at: https://colab.research.google.com/github/MSFTserver/AI-Colab-Notebooks/blob/main/Super_SloMo.ipynb#scrollTo=Wz4BaariVdh5
  2. Run cells in "Download Super-Slomo Repo & Model"
  3. Run cells in "Run this block and Upload Video by clicking the Button that pops up below this codeblock! Wait till it loads the video and once it's done run the next block"
  4. Navigate in dialog box to file and upload file to server.
  5. No need to enter file path in this particular notebook as the upload itself conveys the file path.
  6. Run the main code.
  7. Error/Abnormal behavior

Expected behavior Up until this week when the error started, the colab output 2 files... the first being the .mkv file that the program encodes natively and then the .mp4 that results from conversion.

Interpolated results/error output Super Slow Mo Error

Desktop (please complete the following information):

Additional context No additional context other than I have been having the same issue in each notebook that I am aware of which fork off the main repository, not just the one I've linked to here.

Thank you for your help, I really miss this utility!

Cheers

MSFTserver commented 2 years ago

Yea im unsure of this error and havent had time to look into on my end,i made the colab you linked lol just happened to check here if anyone else had issues or it was solved, im sure it has be something along the lines of dependency change, which though is hard to say i need run it locally and get the full ffmpeg error output as it likes through all of its build configuration flags before that actual error. And yea it seems its broken across the board for last week or so

Craig-Leko commented 2 years ago

Yes, I saw that colab on your github. I really appreciate you taking the time to look into it. I can tell by your contributions that you're very busy. This is a very important utility for a lot of users who I am representing with this official bug post. So your efforts in tracking down the issue and either deprecating or updating the dependencies is going to be very helpful to our community of animators. Thank you very much.

karanpanjabi commented 2 years ago

Full traceback on a sample file:

ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
  libavutil      55. 78.100 / 55. 78.100
  libavcodec     57.107.100 / 57.107.100
  libavformat    57. 83.100 / 57. 83.100
  libavdevice    57. 10.100 / 57. 10.100
  libavfilter     6.107.100 /  6.107.100
  libavresample   3.  7.  0 /  3.  7.  0
  libswscale      4.  8.100 /  4.  8.100
  libswresample   2.  9.100 /  2.  9.100
  libpostproc    54.  7.100 / 54.  7.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'upload/video.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: mp42isom
  Duration: 00:00:06.55, start: 0.000000, bitrate: 1846 kb/s
    Stream #0:0(und): Video: h264 (Baseline) (avc1 / 0x31637661), yuv420p(tv, unknown/bt470bg/unknown), 320x640, 1718 kb/s, 29.45 fps, 29.58 tbr, 90k tbn, 180k tbc (default)
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 131 kb/s (default)
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> png (native))
Press [q] to stop, [?] for help
Output #0, image2, to '.tmpSuperSloMo/input/%06d.png':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: mp42isom
    encoder         : Lavf57.83.100
    Stream #0:0(und): Video: png, rgb24, 320x640, q=2-31, 200 kb/s, 29.58 fps, 29.58 tbn, 29.58 tbc (default)
    Metadata:
      encoder         : Lavc57.107.100 png
frame=  192 fps= 51 q=-0.0 Lsize=N/A time=00:00:06.52 bitrate=N/A speed=1.73x    
video:56491kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Traceback (most recent call last):
  File "Super-SloMo/video_to_slomo.py", line 231, in <module>
    main()
  File "Super-SloMo/video_to_slomo.py", line 167, in main
    dict1 = torch.load(args.checkpoint, map_location='cpu')
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 777, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

I feel this could possibly be some version issue with pytorch and pickled checkpoint of model. Maybe a new checkpoint could be created after training on the latest version of pytorch (1.10.0+cu111 on colab)?

RichardSlater commented 2 years ago

I feel this could possibly be some version issue with pytorch and pickled checkpoint of model. Maybe a new checkpoint could be created after training on the latest version of pytorch (1.10.0+cu111 on colab)?

Currently taking a look at this as I have CoLab Pro+:

import torch

print(torch.__version__)

1.10.0+cu111

RichardSlater commented 2 years ago

I'm assuming that this isn't right:

image

Specifically, the Peak Signal-to-Noise ratio drops from ~30 to 11.5 suddenly at the ~75th epoch, going to let it finish and test out the resulting checkpoints.

RichardSlater commented 2 years ago

I've tried both of the checkpoint files at epoch 200 and epoch 70 however all interpolated frames are grey. I'm going to trim the notebook down and make sure I've not missed something.

Craig-Leko commented 2 years ago

I've tried both of the checkpoint files at epoch 200 and epoch 70 however all interpolated frames are grey. I'm going to trim the notebook down and make sure I've not missed something.

Thank you very much for working on this issue. This interpolator is a very valuable tool for many of us in the animation community. Cheers!

RichardSlater commented 2 years ago

Thank you very much for working on this issue. This interpolator is a very valuable tool for many of us in the animation community. Cheers!

No problem, I'm in the same boat as you using this for animations up until this stopped working a couple of weeks ago.

Currently it appears some versions of ffmpeg are unable to process some of the data in create_dataset.py it's possible that these failed conversions are the cause of this issue:

Application provided invalid, non monotonically increasing dts to muxer in stream 0: 457 >= 457
RichardSlater commented 2 years ago

I've made some changes to create_dataset.py which should hopefully result in a complete dataset. The original script was silently failing and not generating a complete dataset.

This may not be the root cause of the problem, however, I shall continue to plug away at it. I'm starting to gain a better understanding of the data structures within the pre-trained model.

RichardSlater commented 2 years ago

Right, so making more progress this, it's at epoch 106 and still going strong:

 Loss: 13.346040  Iterations:   99/1101  TrainExecTime: 44.5  ValLoss:12.191722  ValPSNR: 30.4959  ValEvalTime: 6.02 LearningRate: 0.000010
 Loss: 13.468689  Iterations:  199/1101  TrainExecTime: 43.7  ValLoss:12.110301  ValPSNR: 30.5903  ValEvalTime: 6.01 LearningRate: 0.000010
 Loss: 13.202195  Iterations:  299/1101  TrainExecTime: 43.6  ValLoss:12.183218  ValPSNR: 30.5098  ValEvalTime: 6.00 LearningRate: 0.000010
 Loss: 13.380725  Iterations:  399/1101  TrainExecTime: 43.7  ValLoss:12.171057  ValPSNR: 30.5121  ValEvalTime: 6.00 LearningRate: 0.000010
 Loss: 13.754861  Iterations:  499/1101  TrainExecTime: 43.8  ValLoss:12.147275  ValPSNR: 30.5573  ValEvalTime: 6.04 LearningRate: 0.000010
 Loss: 13.161016  Iterations:  599/1101  TrainExecTime: 43.7  ValLoss:12.098243  ValPSNR: 30.6064  ValEvalTime: 5.99 LearningRate: 0.000010
 Loss: 12.963065  Iterations:  699/1101  TrainExecTime: 43.7  ValLoss:12.126303  ValPSNR: 30.5667  ValEvalTime: 5.96 LearningRate: 0.000010
 Loss: 13.230197  Iterations:  799/1101  TrainExecTime: 43.8  ValLoss:12.100588  ValPSNR: 30.6165  ValEvalTime: 6.09 LearningRate: 0.000010
 Loss: 13.386173  Iterations:  899/1101  TrainExecTime: 43.8  ValLoss:12.122585  ValPSNR: 30.6012  ValEvalTime: 6.01 LearningRate: 0.000010
 Loss: 13.316765  Iterations:  999/1101  TrainExecTime: 43.8  ValLoss:12.122446  ValPSNR: 30.6058  ValEvalTime: 6.01 LearningRate: 0.000010

106-progress

Plot(1, cLoss, 'red')
Plot(1, valLoss, 'blue')
Plot(2, valPSNR, 'green')

I suspect the data issue caused the GPU to crash, I suspect this is going to take another 24 hours or so to run.

RichardSlater commented 2 years ago

The next thing I have run into is the fact that once the 24 hours on Google Colab+ is up, and I restart the training using the TRAINING_CONTINUE flag the training rate resets to 0.000100 from 0.000001 which substantively decreases the PSNR and increases the loss:

image

I'm going to look into optimizer.lr_scheduler.MultiStepLR() as this feels like a bug, failing that I would have to spin up a VM in Google Cloud (trivial with Terraform, but relatively expensive) for a couple of days to let this run unimpeded. Ideally, however, I would like to leave this in a state where anyone with a Colab Pro+ subscription from the animation community (or other) to spin this up without having to know GCP/GCE.

Latest (untested) checkpoint is here: https://www.dropbox.com/s/bt0z5lah0gtp82q/SuperSloMo-Python37%2B1.10.0%2Bcu111-epoch150.ckpt?dl=0

RichardSlater commented 2 years ago

@MSFTserver @Craig-Leko @karanpanjabi - Initial testing seems to suggest that despite my issues above getting all 200 iterations completed in Colab the following checkpoint works with the original author's YAML configs in my heavily edited version of A.I Whisperer's Intermediate Animation VQGAN+CLIP notebook:

Can I ask that you don't share the link widely as DropBox will only allow for ~130 downloads of a single link in a day; I'll have a think about hosting this in a more sensible way.

MSFTserver commented 2 years ago

Awesome i shall test tomorrow and i can reupload it to my dropbox, i have pro which i believe allows unlimited downloads and 400gb bandwidth a day. Why they have both caps im not sure as it seems limit is 400gb anyway.

Sent from my SM-G988U using FastHub

Craig-Leko commented 2 years ago

It is working for me. Thanks again.

RichardSlater commented 2 years ago

Awesome i shall test tomorrow and i can reupload it to my dropbox, i have pro which i believe allows unlimited downloads and 400gb bandwidth a day. Why they have both caps im not sure as it seems limit is 400gb anyway.

That would be fantastic. I only have Dropbox Plus. 400GB is 2,600 downloads a day which I think should be enough. I'll leave the above links available as backups. If bandwidth becomes a problem in the future we can look at other options.

It is working for me. Thanks again.

Fantastic to hear! are you posting what you are doing publically? if so drop me a link, I'd be interested to see what you are doing.

karanpanjabi commented 2 years ago

This was amazing! Thanks a lot @RichardSlater for fixing this and creating the new checkpoint file. I think we can also put the checkpoint file on git itself through git-lfs for long term purposes but maybe that can be brought up later through another issue ticket.

MSFTserver commented 2 years ago

@karanpanjabi gitlfs is better for private repositories, i found this out the hard way as anyone that forks or proceeds to make changes to thier fork also reflects on the original owners repository lfs usage.

https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage#tracking-storage-and-bandwidth-use

Sent from my SM-G988U using FastHub

MSFTserver commented 2 years ago

feel free to use this URL in projects https://www.dropbox.com/s/f2f5pi76z6aaehe/SuperSloMo-Python37%2B1.10.0%2Bcu111-epoch150.ckpt

karanpanjabi commented 2 years ago

I found another issue which seems trivial now. Maybe a retraining was not required 🙈. Sorry for not catching this earlier. In the original colab notebook, the function download_from_google_drive is not actually able to fetch the proper checkpoint file from google drive. The file fetched is just a few KB while the actual ckpt is ~150mb. It's downloading an html page due to some logical error in forming the url.

<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 400 (Bad Request)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>400.</b> <ins>That’s an error.</ins>
  <p>Your client has issued a malformed or illegal request.  <ins>That’s all we know.</ins>

The first '<' is what we see in the UnpicklingError above.

carriemorrison commented 2 years ago

@karanpanjabi you can upload the new chkpt file to your own drive and use that instead.


from google.colab import drive

google_drive = True #@param {type:"boolean"}

pretrained_model = '/content/gdrive/MyDrive/SuperSloMo-Python37+1.10.0+cu111-epoch150.ckpt'
Craig-Leko commented 2 years ago

It is working for me. Thanks again.

Fantastic to hear! are you posting what you are doing publically? if so drop me a link, I'd be interested to see what you are doing.

Yes -> https://instagram.com/zeebohm

MSFTserver commented 2 years ago

@karanpanjabi wierd because it was working, wonder why the curl commands all of sudden started to fail for that, google drive does have way lower thresholds for downloads so maybe we tapped it out for api requests vs manually through browser.

In any event my notebooks have been updated and i tested it.

Sent from my SM-G988U using FastHub

RichardSlater commented 2 years ago

I found another issue which seems trivial now. Maybe a retraining was not required 🙈.

😆 If nothing else we have gained two things:

  1. I have learned a lot process, and a better conceptual model will help if we ever need to revisit this.
  2. We have validated that we are able to re-create the neural network 4 years down the line.

Credit to @avinashpaliwal for putting this out there in the first place, and of course for the immense effort the paper authors put in architecting the neural network int he first place.

Yes -> https://instagram.com/zeebohm

That's awesome, I'm mainly on TikTok.