Open xRoyBx opened 3 years ago
I think you didn't build the packages correctly, make sure you have PyTorch >=1.0.0, <=1.4.0. If you're just intrested in doing the interpolation, check my repo iBobbyTS/VFIN, it's very easy to use, DAIN is included, there are colab notebooks that I tested, and there's also a tar file with everything (python, pytorch and dain packages) installed correctly. Open an issue there if there are any problem with it.
I don't know how to install/change PyTorch >=1.0.0, <=1.4.0 in colab (Win10). Anyway, using the "official" notebook by Styler00Dollar and Alpha or other related notebooks, i get the same error. I'll try with VFIN, thanks ;)
To install pytorch 1.4, simply run this:
pip install torch==1.4.0
if you're using a notebook like Colab, add a ! befor, like
!pip install torch==1.4.0
I’m modifying code in VFIN very often now, so there mifht be errors while someone else use it. I'm still learning about GitHub, I might start using the brunch and release systems to keep stable versions and somewhere else to develop.
This error has popped up a few times in the issues section of various DAIN colab repos. I have read it's possibly something with the GPU build. I would love for someone to look into this because I am not familiar at all with this area of coding and I've only gotten DAIN to work once (probably when I was assigned a P100). I get this same error when using different colabs for DAIN, specifically when I get a V100 I think.
Has anyone found a solution???
Hi! Yes, I know what the error is. I don’t know exactly how to fix it.
This is the relevant error line:
error in correlation_forward_cuda_kernel: no kernel image is available for execution on the device
This means that the C native module built into CUDA is not properly compiled to a matching device. As Google keeps adding new models to their Colab support, we will keep finding these issues. This explains the symptoms that @TaoTeCha is seeing.
My first attempt at this was achieved in #87, where I manually added a bunch of models for all the GPUs I could find in Colab at that time (June 2020). I also added a structure that hopefully made it easier to add more overtime... but it’s less than ideal.
@xRoyBx Note that the version of the Colab with those fixes also suppresses a few warnings that I saw in your logs. Is it possible you’re not using the latest one? 1.5 is already in master, 1.5.1 is in a PR (#116).
If anyone knows how to achieve future general compatibility, I’d be glad to work on that.
I've trained and ran a handful of deep learning models in colab but in every case the GPU has been all set up and ready to go so I am totally ignorant with all this. I have a couple questions.
1) Why do you need to do a 15 minute 'build' with DAIN when I have never had to do this with any other model I've used? 2) What parameters do I need to change in the files to find a model that works with colab's V100? I'm willing to put in the trial and error work if someone enlightens me in what I should be changing.
Thanks
DAIN is a mixture of different CNNs put together, some of them from previous papers. You can find more info here and in the original paper. So that you don’t have to run 6 CNNs in parallel, which is memory-expensive and incredibly slow, the authors compiled some of these “layers” into CUDA modules they could run in the GPU directly to train and infer with DAIN. These modules are the ones taking ~15 minutes and giving us these headaches.
Check out this file.
Not sure if it's a coincidence but I uncommented '-gencode', 'arch=compute_70,code=sm_70' in the compiler_args.py and switched to !pip install torch==1.0.0 torchvision==0.2.1
The colab is working with a V100 now. I'll probably use this a handful of times over the next week and I'll keep you updated if it continues to work.
Thanks!
Thanks for interpolation fix, unfortunately i get this error when creating output video, apparently it doesn't create output frames:
CalledProcessError Traceback (most recent call last)
Are you sure your output path exists? Do you have a folder in your drive named DAIN? When you mounted you drive, did you mount it as gdrive or just drive? Try changing the output to '/content/output.mp4' and just download from the colab file folder.
Or try !ffmpeg instead of %shell ffmpeg
!ffmpeg
Are you sure your output path exists? Do you have a folder in your drive named DAIN? When you mounted you drive, did you mount it as gdrive or just drive? Try changing the output to '/content/output.mp4' and just download from the colab file folder.
Or try !ffmpeg instead of %shell ffmpeg
Paths are ok, DAIN folder is present, drive mounted as Gdrive: the problem is always the same (forget my previous post, sorry): it doesn't create output png frames despite the output folder is present (using Tesla V100 in colab)
In VFIN, you don't need to worry about that, just specify -ot video, it will generate a mp4 in your input folder, if you specify the outpht by -o , you can use any extension and save it anywhere.
In VFIN, you don't need to worry about that, just specify -ot video, it will generate a mp4 in your input folder, if you specify the outpht by -o , you can use any extension and save it anywhere.
using this command: !/content/python/bin/python3 /content/VFIN/run.py -i "/content/drive/My Drive/VFIN/input.mp4" -o "/content/drive/My Drive/VFIN/output.mp4"
i get this error /content/python/bin/python3: can't open file '/content/VFIN/run.py': [Errno 2] No such file or directory
In VFIN, you don't need to worry about that, just specify -ot video, it will generate a mp4 in your input folder, if you specify the outpht by -o , you can use any extension and save it anywhere.
using this command:
!/content/python/bin/python3 /content/VFIN/run.py -i "/content/drive/My Drive/VFIN/input.mp4" -o "/content/drive/My Drive/VFIN/output.mp4"
i get this error
/content/python/bin/python3: can't open file '/content/VFIN/run.py': [Errno 2] No such file or directory
Sorry, I changed the name of the runing file, for this time, use
!/content/python/bin/python3 /content/VFIN/run_class.py -i "/content/drive/My Drive/VFIN/input.mp4" -o "/content/drive/My Drive/VFIN/output.mp4"
instead.
I fixed the GitHub repo and the pre-built tar file, copy the tar file to your drive again and use it next time, copy the notebook too, I edited it.
By the way, you need -a DAIN -ot video
to make it use DAIN and output a video.
Any problem about VFIN, please open issues there.
Hello, i always get this error during interpolation and can't proceed forward:
/content/DAIN revise the unique id to a random numer 68776 Namespace(SAVED_MODEL=None, alpha=[0.0, 1.0], arg='./model_weights/68776-Tue-Nov-10-11-46/args.txt', batch_size=1, channels=3, ctx_lr_coe=1.0, datasetName='Vimeo_90K_interp', datasetPath='', dataset_split=97, debug=False, depth_lr_coe=0.001, dtype=<class 'torch.cuda.FloatTensor'>, end_frame=1259, epsilon=1e-06, factor=0.2, filter_lr_coe=1.0, filter_size=4, flow_lr_coe=0.01, force=False, frame_input_dir='/content/DAIN/input_frames', frame_output_dir='/content/DAIN/output_frames', log='./model_weights/68776-Tue-Nov-10-11-46/log.txt', lr=0.002, netName='DAIN_slowmotion', no_date=False, numEpoch=100, occ_lr_coe=1.0, patience=5, rectify_lr=0.001, save_path='./model_weights/68776-Tue-Nov-10-11-46', save_which=1, seed=1, start_frame=1, time_step=0.5, uid=None, use_cuda=True, use_cudnn=1, weight_decay=0, workers=8) cudnn is used Interpolate 1 frames error in correlation_forward_cuda_kernel: no kernel image is available for execution on the device Warning: Legacy autograd function with non-static forward method is deprecated and will be removed in 1.3. Please use new-style autograd function with static forward method. (Example: https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function) (THPFunction_do_forward at /pytorch/torch/csrc/autograd/python_function.cpp:622) Traceback (most recent call last): File "colab_interpolate.py", line 112, in
y_s, offset, filter = model(torch.stack((X0, X1),dim = 0))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, kwargs)
File "/content/DAIN/networks/DAIN_slowmotion.py", line 148, in forward
self.forward_flownets(self.flownets, cur_offset_input, time_offsets=time_offsets),
File "/content/DAIN/networks/DAIN_slowmotion.py", line 212, in forward_flownets
temp = model(input) # this is a single direction motion results, but not a bidirectional one
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/content/DAIN/PWCNet/PWCNet.py", line 221, in forward
corr6 = self.corr(c16, c26)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, kwargs)
File "/content/DAIN/PWCNet/correlation_package_pytorch1_0/correlation.py", line 59, in forward
result = CorrelationFunction(self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)(input1, input2)
File "/content/DAIN/PWCNet/correlation_package_pytorch1_0/correlation.py", line 27, in forward
self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)
RuntimeError: CUDA call failed (correlation_forward_cuda at correlation_cuda.cc:80)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3c81ae3193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: correlation_forward_cuda(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, int, int, int, int, int, int) + 0x628 (0x7f3c7e117b38 in /usr/local/lib/python3.6/dist-packages/correlation_cuda-0.0.0-py3.6-linux-x86_64.egg/correlation_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #2: + 0x1bd4a (0x7f3c7e127d4a in /usr/local/lib/python3.6/dist-packages/correlation_cuda-0.0.0-py3.6-linux-x86_64.egg/correlation_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x18890 (0x7f3c7e124890 in /usr/local/lib/python3.6/dist-packages/correlation_cuda-0.0.0-py3.6-linux-x86_64.egg/correlation_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #4: python3() [0x50a4a5]