Open Craig-Leko opened 2 years ago
Yea im unsure of this error and havent had time to look into on my end,i made the colab you linked lol just happened to check here if anyone else had issues or it was solved, im sure it has be something along the lines of dependency change, which though is hard to say i need run it locally and get the full ffmpeg error output as it likes through all of its build configuration flags before that actual error. And yea it seems its broken across the board for last week or so
Yes, I saw that colab on your github. I really appreciate you taking the time to look into it. I can tell by your contributions that you're very busy. This is a very important utility for a lot of users who I am representing with this official bug post. So your efforts in tracking down the issue and either deprecating or updating the dependencies is going to be very helpful to our community of animators. Thank you very much.
Full traceback on a sample file:
ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers
built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
libavutil 55. 78.100 / 55. 78.100
libavcodec 57.107.100 / 57.107.100
libavformat 57. 83.100 / 57. 83.100
libavdevice 57. 10.100 / 57. 10.100
libavfilter 6.107.100 / 6.107.100
libavresample 3. 7. 0 / 3. 7. 0
libswscale 4. 8.100 / 4. 8.100
libswresample 2. 9.100 / 2. 9.100
libpostproc 54. 7.100 / 54. 7.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'upload/video.mp4':
Metadata:
major_brand : mp42
minor_version : 0
compatible_brands: mp42isom
Duration: 00:00:06.55, start: 0.000000, bitrate: 1846 kb/s
Stream #0:0(und): Video: h264 (Baseline) (avc1 / 0x31637661), yuv420p(tv, unknown/bt470bg/unknown), 320x640, 1718 kb/s, 29.45 fps, 29.58 tbr, 90k tbn, 180k tbc (default)
Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 131 kb/s (default)
Stream mapping:
Stream #0:0 -> #0:0 (h264 (native) -> png (native))
Press [q] to stop, [?] for help
Output #0, image2, to '.tmpSuperSloMo/input/%06d.png':
Metadata:
major_brand : mp42
minor_version : 0
compatible_brands: mp42isom
encoder : Lavf57.83.100
Stream #0:0(und): Video: png, rgb24, 320x640, q=2-31, 200 kb/s, 29.58 fps, 29.58 tbn, 29.58 tbc (default)
Metadata:
encoder : Lavc57.107.100 png
frame= 192 fps= 51 q=-0.0 Lsize=N/A time=00:00:06.52 bitrate=N/A speed=1.73x
video:56491kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Traceback (most recent call last):
File "Super-SloMo/video_to_slomo.py", line 231, in <module>
main()
File "Super-SloMo/video_to_slomo.py", line 167, in main
dict1 = torch.load(args.checkpoint, map_location='cpu')
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 777, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
I feel this could possibly be some version issue with pytorch and pickled checkpoint of model. Maybe a new checkpoint could be created after training on the latest version of pytorch (1.10.0+cu111 on colab)?
I feel this could possibly be some version issue with pytorch and pickled checkpoint of model. Maybe a new checkpoint could be created after training on the latest version of pytorch (1.10.0+cu111 on colab)?
Currently taking a look at this as I have CoLab Pro+:
import torch
print(torch.__version__)
1.10.0+cu111
I'm assuming that this isn't right:
Specifically, the Peak Signal-to-Noise ratio drops from ~30 to 11.5 suddenly at the ~75th epoch, going to let it finish and test out the resulting checkpoints.
I've tried both of the checkpoint files at epoch 200 and epoch 70 however all interpolated frames are grey. I'm going to trim the notebook down and make sure I've not missed something.
Thank you very much for working on this issue. This interpolator is a very valuable tool for many of us in the animation community. Cheers!
Thank you very much for working on this issue. This interpolator is a very valuable tool for many of us in the animation community. Cheers!
No problem, I'm in the same boat as you using this for animations up until this stopped working a couple of weeks ago.
Currently it appears some versions of ffmpeg
are unable to process some of the data in create_dataset.py
it's possible that these failed conversions are the cause of this issue:
Application provided invalid, non monotonically increasing dts to muxer in stream 0: 457 >= 457
I've made some changes to create_dataset.py
which should hopefully result in a complete dataset. The original script was silently failing and not generating a complete dataset.
This may not be the root cause of the problem, however, I shall continue to plug away at it. I'm starting to gain a better understanding of the data structures within the pre-trained model.
Right, so making more progress this, it's at epoch 106 and still going strong:
Loss: 13.346040 Iterations: 99/1101 TrainExecTime: 44.5 ValLoss:12.191722 ValPSNR: 30.4959 ValEvalTime: 6.02 LearningRate: 0.000010
Loss: 13.468689 Iterations: 199/1101 TrainExecTime: 43.7 ValLoss:12.110301 ValPSNR: 30.5903 ValEvalTime: 6.01 LearningRate: 0.000010
Loss: 13.202195 Iterations: 299/1101 TrainExecTime: 43.6 ValLoss:12.183218 ValPSNR: 30.5098 ValEvalTime: 6.00 LearningRate: 0.000010
Loss: 13.380725 Iterations: 399/1101 TrainExecTime: 43.7 ValLoss:12.171057 ValPSNR: 30.5121 ValEvalTime: 6.00 LearningRate: 0.000010
Loss: 13.754861 Iterations: 499/1101 TrainExecTime: 43.8 ValLoss:12.147275 ValPSNR: 30.5573 ValEvalTime: 6.04 LearningRate: 0.000010
Loss: 13.161016 Iterations: 599/1101 TrainExecTime: 43.7 ValLoss:12.098243 ValPSNR: 30.6064 ValEvalTime: 5.99 LearningRate: 0.000010
Loss: 12.963065 Iterations: 699/1101 TrainExecTime: 43.7 ValLoss:12.126303 ValPSNR: 30.5667 ValEvalTime: 5.96 LearningRate: 0.000010
Loss: 13.230197 Iterations: 799/1101 TrainExecTime: 43.8 ValLoss:12.100588 ValPSNR: 30.6165 ValEvalTime: 6.09 LearningRate: 0.000010
Loss: 13.386173 Iterations: 899/1101 TrainExecTime: 43.8 ValLoss:12.122585 ValPSNR: 30.6012 ValEvalTime: 6.01 LearningRate: 0.000010
Loss: 13.316765 Iterations: 999/1101 TrainExecTime: 43.8 ValLoss:12.122446 ValPSNR: 30.6058 ValEvalTime: 6.01 LearningRate: 0.000010
Plot(1, cLoss, 'red')
Plot(1, valLoss, 'blue')
Plot(2, valPSNR, 'green')
I suspect the data issue caused the GPU to crash, I suspect this is going to take another 24 hours or so to run.
The next thing I have run into is the fact that once the 24 hours on Google Colab+ is up, and I restart the training using the TRAINING_CONTINUE
flag the training rate resets to 0.000100
from 0.000001
which substantively decreases the PSNR and increases the loss:
I'm going to look into optimizer.lr_scheduler.MultiStepLR()
as this feels like a bug, failing that I would have to spin up a VM in Google Cloud (trivial with Terraform, but relatively expensive) for a couple of days to let this run unimpeded. Ideally, however, I would like to leave this in a state where anyone with a Colab Pro+ subscription from the animation community (or other) to spin this up without having to know GCP/GCE.
Latest (untested) checkpoint is here: https://www.dropbox.com/s/bt0z5lah0gtp82q/SuperSloMo-Python37%2B1.10.0%2Bcu111-epoch150.ckpt?dl=0
@MSFTserver @Craig-Leko @karanpanjabi - Initial testing seems to suggest that despite my issues above getting all 200 iterations completed in Colab the following checkpoint works with the original author's YAML configs in my heavily edited version of A.I Whisperer's Intermediate Animation VQGAN+CLIP notebook:
Can I ask that you don't share the link widely as DropBox will only allow for ~130 downloads of a single link in a day; I'll have a think about hosting this in a more sensible way.
Awesome i shall test tomorrow and i can reupload it to my dropbox, i have pro which i believe allows unlimited downloads and 400gb bandwidth a day. Why they have both caps im not sure as it seems limit is 400gb anyway.
Sent from my SM-G988U using FastHub
It is working for me. Thanks again.
Awesome i shall test tomorrow and i can reupload it to my dropbox, i have pro which i believe allows unlimited downloads and 400gb bandwidth a day. Why they have both caps im not sure as it seems limit is 400gb anyway.
That would be fantastic. I only have Dropbox Plus. 400GB is 2,600 downloads a day which I think should be enough. I'll leave the above links available as backups. If bandwidth becomes a problem in the future we can look at other options.
It is working for me. Thanks again.
Fantastic to hear! are you posting what you are doing publically? if so drop me a link, I'd be interested to see what you are doing.
This was amazing! Thanks a lot @RichardSlater for fixing this and creating the new checkpoint file. I think we can also put the checkpoint file on git itself through git-lfs for long term purposes but maybe that can be brought up later through another issue ticket.
@karanpanjabi gitlfs is better for private repositories, i found this out the hard way as anyone that forks or proceeds to make changes to thier fork also reflects on the original owners repository lfs usage.
Sent from my SM-G988U using FastHub
feel free to use this URL in projects https://www.dropbox.com/s/f2f5pi76z6aaehe/SuperSloMo-Python37%2B1.10.0%2Bcu111-epoch150.ckpt
I found another issue which seems trivial now. Maybe a retraining was not required 🙈. Sorry for not catching this earlier. In the original colab notebook, the function download_from_google_drive is not actually able to fetch the proper checkpoint file from google drive. The file fetched is just a few KB while the actual ckpt is ~150mb. It's downloading an html page due to some logical error in forming the url.
<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
<title>Error 400 (Bad Request)!!1</title>
<style>
*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
</style>
<a href=//www.google.com/><span id=logo aria-label=Google></span></a>
<p><b>400.</b> <ins>That’s an error.</ins>
<p>Your client has issued a malformed or illegal request. <ins>That’s all we know.</ins>
The first '<' is what we see in the UnpicklingError above.
@karanpanjabi you can upload the new chkpt
file to your own drive and use that instead.
from google.colab import drive
google_drive = True #@param {type:"boolean"}
pretrained_model = '/content/gdrive/MyDrive/SuperSloMo-Python37+1.10.0+cu111-epoch150.ckpt'
It is working for me. Thanks again.
Fantastic to hear! are you posting what you are doing publically? if so drop me a link, I'd be interested to see what you are doing.
@karanpanjabi wierd because it was working, wonder why the curl commands all of sudden started to fail for that, google drive does have way lower thresholds for downloads so maybe we tapped it out for api requests vs manually through browser.
In any event my notebooks have been updated and i tested it.
Sent from my SM-G988U using FastHub
I found another issue which seems trivial now. Maybe a retraining was not required 🙈.
😆 If nothing else we have gained two things:
Credit to @avinashpaliwal for putting this out there in the first place, and of course for the immense effort the paper authors put in architecting the neural network int he first place.
That's awesome, I'm mainly on TikTok.
Thank you very much for this project. I love it, and I've been using it successfully since early January. However, in the past week or so, it has stopped working on every colab fork I've tried. Perhaps there has been an incompatible version update in a dependency somewhere?
Describe the bug After loading the dependencies, I input the file path to the uploaded video I want to interpolate. On execution, I get the attached error, which is a standard error being thrown by the process.communicate() method in().
To Reproduce Behavior happens every time. Colab is not running out of memory, as I use CO Pro+ and I have taken the images down to 100x100 pixels and 50 total frames to ensure the file is sufficiently small to rule out running out of RAM as the issue.
Expected behavior Up until this week when the error started, the colab output 2 files... the first being the .mkv file that the program encodes natively and then the .mp4 that results from conversion.
Interpolated results/error output
Desktop (please complete the following information):
Additional context No additional context other than I have been having the same issue in each notebook that I am aware of which fork off the main repository, not just the one I've linked to here.
Thank you for your help, I really miss this utility!
Cheers