有关colab训练模型的坑及解决方案（Pits and solutions about training on colab）

jjandnn commented 4 years ago

1，cuda出错，no kernel image和no memory： colab主要有四种显卡：k80、p4、T4、p100。其中p100和T4，在安装运行flownet2_pytorh时（每次都需要装），不会有问题，直接按照官方read me运行，或者进入few-shot-vid2vid/models/networks/flownet2_pytorch/后，!bash install.sh。而k80和p4则会报错，cuda kernel：no kernel image……（一个千年未解决的cuda老问题）最佳解决方案为：直接重置所有代码执行程序，换主机，换到T4或p100为止，最省力，最高效，呵呵。其次的方案为修改你的 /content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/channelnorm_package/setup.py； /content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/correlation_package/setup.py； /content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/resample2d_package/setup.py；在三个文件中添加上适配的环境变量： nvcc_args = [ '-gencode', 'arch=compute_30,code=sm_30', '-gencode', 'arch=compute_35,code=sm_35', '-gencode', 'arch=compute_37,code=sm_37', '-gencode', 'arch=compute_50,code=sm_50', '-gencode', 'arch=compute_52,code=sm_52', '-gencode', 'arch=compute_60,code=sm_60', '-gencode', 'arch=compute_61,code=sm_61', '-gencode', 'arch=compute_70,code=sm_70', '-gencode', 'arch=compute_70,code=compute_70' ]

k80请强行指定pytorch==0.41。

2，继续训练时web预览图片读取错误：input/output：epoch…… 这个错误，本地不会发生。只有colab与谷歌云盘。原因：谷歌云盘的文件夹内文件过多，colab无法读入（也是老问题）。解决方案：进入/content/drive/My Drive/few-shot-vid2vid/checkpoints/face/web/ 删掉整个images文件夹中，再生成一个空的images就可以了。或者在训练时加上‘--no_html‘参数（我未测试，因为我需要预览）

注意：继续训练，iters不一定为整数，不影响结果。

3，50个epoch后，seq length to XX，out of memory内存溢出。解决方案：只有换P100，T4不行。 70个epoch后，seq length to 16。还没想出方案，诶，flownet2升级验证太大了。

注意：第3条的问题，已经由程序主-@tcwang0509 升级修正了，现在跑起来很流畅。就是70个epoch后，colab的主机速度比较慢，这是没办法的，免费啊！——2020.2.13

English（machine translation，forgive me）： 1, cuda error, no kernel image and no memory: colab mainly has four kinds of graphics cards: k80, p4, p100,T4. Among them, p100 and T4, when installing and running flownet2_pytorh (need to be installed each time), there will be no problem, run directly according to the official read me, or enter fee-shot-vid2vid / models / networks / flownet2_pytorch /, and bash install.sh. But k80 and p4 will report an error, cuda kernel: no kernel image ... (an unsolved old problem of cuda for a thousand years) The best solution is: directly reset all code execution programs, change the host, change to p100 or T4, the most labor-saving , The most efficient, huh, huh. The second solution is to modify your / content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / channelnorm_package / setup.py; / content / drive / My Drive / few-shot-vid2vid / models /networks/flownet2_pytorch/networks/correlation_package/setup.py; / content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / resample2d_package / setup.py; add adaptations to the three files Environment variables: nvcc_args = ['-gencode', 'arch = compute_30, code = sm_30', '-gencode', 'arch = compute_35, code = sm_35', '-gencode', 'arch = compute_37, code = sm_37 ',' -gencode ',' arch = compute_50, code = sm_50 ',' -gencode ',' arch = compute_52, code = sm_52 ',' -gencode ',' arch = compute_60, code = sm_60 ',' -gencode ',' arch = compute_61, code = sm_61 ',' -gencode ',' arch = compute_70, code = sm_70 ',' -gencode ',' arch = compute_70, code = compute_70 '] k80 Please specify pytorch == 0.41 .

2，Web preview image reading error when continuing training: input / output: epoch ... This error does not occur locally. Only Colab and Google Cloud Disk. Cause: There are too many files in the Google Cloud Disk folder, and Colab cannot read them (also an old problem). Solution: Go to / content / drive / My Drive / few-shot-vid2vid / checkpoints / face / web/ Delete the entire images folder and generate an empty images. Or add ‘--no_html’ parameter during training (I have n’t tested it because I need to preview it)

Note: If you continue to train, iters may not be an integer and will not affect the result.

3，After 50 epochs, seq length to XX, out of memory. Solution: Change P100 instead of T4.

After 70 epochs, seq length to 16. Haven't figured out a solution yet.

Note: The problem of Article 3 has been upgraded and corrected by the program owner-@ tcwang0509, and now it runs smoothly. After 70 epochs, the host of Colab is relatively slow. There is no way to do it, it's free!——2020.2.13

程序很棒，感谢开发者，感谢NVlabs（我刚发了疯做空了NV的股票，以为few-v2v和stylegan2赶不上年底的………………诶，天啊！）祝大家顺利，愉快！ The program is great, thanks to the developers, thanks to NVlabs (I just went crazy and shorted the stock of NV, thinking that few-v2v and stylegan2 can't keep up with the end of the year ... oh, my God!) I wish you all a smooth and happy!

AaronWong commented 4 years ago

Q2： train_options.py line 9 parser.add_argument('--display_freq', type=int, default=100, help='frequency of showing training results on screen') or remove iter below util/visualizer.py line 110 _if self.usehtml:

pythagoras000 commented 4 years ago

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

AaronWong commented 4 years ago

Hi @ssaleth

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

I didn't use colab you may ask jjandnn I still train on my server and don‘t achieve a good result

NVlabs / few-shot-vid2vid

有关colab训练模型的坑及解决方案（Pits and solutions about training on colab） #8