NVlabs / few-shot-vid2vid

Pytorch implementation for few-shot photorealistic video-to-video translation.
1.79k stars 276 forks source link

有关colab训练模型的坑及解决方案(Pits and solutions about training on colab) #8

Closed jjandnn closed 4 years ago

jjandnn commented 4 years ago

1,cuda出错,no kernel image和no memory: colab主要有四种显卡:k80、p4、T4、p100。 其中p100和T4,在安装运行flownet2_pytorh时(每次都需要装),不会有问题,直接按照官方read me运行,或者进入few-shot-vid2vid/models/networks/flownet2_pytorch/后,!bash install.sh。 而k80和p4则会报错,cuda kernel:no kernel image……(一个千年未解决的cuda老问题) 最佳解决方案为: 直接重置所有代码执行程序,换主机,换到T4或p100为止,最省力,最高效,呵呵。 其次的方案为修改你的 /content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/channelnorm_package/setup.py; /content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/correlation_package/setup.py; /content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/resample2d_package/setup.py; 在三个文件中添加上适配的环境变量: nvcc_args = [ '-gencode', 'arch=compute_30,code=sm_30', '-gencode', 'arch=compute_35,code=sm_35', '-gencode', 'arch=compute_37,code=sm_37', '-gencode', 'arch=compute_50,code=sm_50', '-gencode', 'arch=compute_52,code=sm_52', '-gencode', 'arch=compute_60,code=sm_60', '-gencode', 'arch=compute_61,code=sm_61', '-gencode', 'arch=compute_70,code=sm_70', '-gencode', 'arch=compute_70,code=compute_70' ]


2,继续训练时web预览图片读取错误:input/output:epoch…… 这个错误,本地不会发生。只有colab与谷歌云盘。 原因:谷歌云盘的文件夹内文件过多,colab无法读入(也是老问题)。 解决方案: 进入/content/drive/My Drive/few-shot-vid2vid/checkpoints/face/web/ 删掉整个images文件夹中,再生成一个空的images就可以了。 或者在训练时加上‘--no_html‘参数(我未测试,因为我需要预览)


3,50个epoch后,seq length to XX,out of memory内存溢出。 解决方案:只有换P100,T4不行。 70个epoch后,seq length to 16。 还没想出方案,诶,flownet2升级验证太大了。

注意:第3条的问题,已经由程序主-@tcwang0509 升级修正了,现在跑起来很流畅。就是70个epoch后,colab的主机速度比较慢,这是没办法的,免费啊!——2020.2.13

English(machine translation,forgive me): 1, cuda error, no kernel image and no memory: colab mainly has four kinds of graphics cards: k80, p4, p100,T4. Among them, p100 and T4, when installing and running flownet2_pytorh (need to be installed each time), there will be no problem, run directly according to the official read me, or enter fee-shot-vid2vid / models / networks / flownet2_pytorch /, and bash install.sh. But k80 and p4 will report an error, cuda kernel: no kernel image ... (an unsolved old problem of cuda for a thousand years) The best solution is: directly reset all code execution programs, change the host, change to p100 or T4, the most labor-saving , The most efficient, huh, huh. The second solution is to modify your / content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / channelnorm_package / setup.py; / content / drive / My Drive / few-shot-vid2vid / models /networks/flownet2_pytorch/networks/correlation_package/setup.py; / content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / resample2d_package / setup.py; add adaptations to the three files Environment variables: nvcc_args = ['-gencode', 'arch = compute_30, code = sm_30', '-gencode', 'arch = compute_35, code = sm_35', '-gencode', 'arch = compute_37, code = sm_37 ',' -gencode ',' arch = compute_50, code = sm_50 ',' -gencode ',' arch = compute_52, code = sm_52 ',' -gencode ',' arch = compute_60, code = sm_60 ',' -gencode ',' arch = compute_61, code = sm_61 ',' -gencode ',' arch = compute_70, code = sm_70 ',' -gencode ',' arch = compute_70, code = compute_70 '] k80 Please specify pytorch == 0.41 .

2,Web preview image reading error when continuing training: input / output: epoch ... This error does not occur locally. Only Colab and Google Cloud Disk. Cause: There are too many files in the Google Cloud Disk folder, and Colab cannot read them (also an old problem). Solution: Go to / content / drive / My Drive / few-shot-vid2vid / checkpoints / face / web/ Delete the entire images folder and generate an empty images. Or add ‘--no_html’ parameter during training (I have n’t tested it because I need to preview it)

Note: If you continue to train, iters may not be an integer and will not affect the result.

3,After 50 epochs, seq length to XX, out of memory. Solution: Change P100 instead of T4.

After 70 epochs, seq length to 16. Haven't figured out a solution yet.

Note: The problem of Article 3 has been upgraded and corrected by the program owner-@ tcwang0509, and now it runs smoothly. After 70 epochs, the host of Colab is relatively slow. There is no way to do it, it's free!——2020.2.13

程序很棒,感谢开发者,感谢NVlabs(我刚发了疯做空了NV的股票,以为few-v2v和stylegan2赶不上年底的………………诶,天啊!) 祝大家顺利,愉快! The program is great, thanks to the developers, thanks to NVlabs (I just went crazy and shorted the stock of NV, thinking that few-v2v and stylegan2 can't keep up with the end of the year ... oh, my God!) I wish you all a smooth and happy!

AaronWong commented 4 years ago

Q2: train_options.py line 9 parser.add_argument('--display_freq', type=int, default=100, help='frequency of showing training results on screen') or remove iter below util/visualizer.py line 110 _if self.usehtml:

pythagoras000 commented 4 years ago

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

AaronWong commented 4 years ago

Hi @ssaleth

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

I didn't use colab you may ask jjandnn I still train on my server and don‘t achieve a good result