Question about training

Zhangjzh commented 1 year ago

After training of the implicit MLP, I got quite wired results. The reconstructed meshes are poor. 0_nc The evaluation results shows that NC is very low, but chamfer and p2s are very high. eval Do you know where the problem is? I would appreciate it a lot if you could give me some suggestions!

YuliangXiu commented 1 year ago

Please firstly do visualization to check the input data (scans, SMPL, image, etc) are well aligned, see training.md.


# visualization for SMPL-X mesh
python -m lib.dataloader_demo -v -c ./configs/train/icon-filter.yaml

# visualization for voxelized SMPL
python -m lib.dataloader_demo -v -c ./configs/train/pamir.yaml

Zhangjzh commented 1 year ago

Ok, I will have a try. Thank you very much!

Yuhuoo commented 1 year ago

I find there are may some problems in the preprocess scripts. I use the scripts preprocessing the THuman2.0 dataset, but the preprocessed data cannot aligned to the smplx mesh. Xnip2023-01-07_18-11-39 I need to download the SMPL+X.zip again to align the smplx mesh and the prepocessed data. Xnip2023-01-07_18-18-11 I checked the smplx mesh has be changed after the preprocess. Xnip2023-01-07_19-23-35 The problem others also meet, and this will lead to the failure of training. That's so strange! Can you share some suggestions about this? Very Thanks!

YuliangXiu commented 1 year ago

Hi @Zhangjzh and @Yuhuoo

I have corrected some bugs and updated the scripts for training data generation dataset.md.

Please re-download the SMPL-X.zip, and re-run the data generation scripts:

conda activate icon
python -m scripts.render_batch -headless -out_dir data/

Now data/thuman2/smplx/xxxx.obj and data/thuman2/scans/xxxx.obj are aligned perfectly

Yuhuoo commented 1 year ago

Hi @Zhangjzh and @Yuhuoo

I have corrected some bugs and updated the scripts for training data generation dataset.md.

Please re-download the SMPL-X.zip, and re-run the data generation scripts:
conda activate icon
python -m scripts.render_batch -headless -out_dir data/
Now data/thuman2/smplx/xxxx.obj and data/thuman2/scans/xxxx.obj are aligned perfectly

Very thanks for your reply and resolutions. But I have meet the problems as follows using the scripts, I have tried many ways to solve the problems and I failed. Can you meet the problems?

The complete error message are as follows:

Start Rendering thuman2 with 36 views, 512x512 size.
Output dir: ./debug/thuman2_36views
Rendering types: ['light', 'normal', 'depth']
  0%|                                                                                                     | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/icon/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/miniconda3/envs/icon/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/hy-tmp/ICON/scripts/render_batch.py", line 224, in <module>
    for _ in tqdm(
  File "/usr/local/miniconda3/envs/icon/lib/python3.8/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/usr/local/miniconda3/envs/icon/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f89f2e9f5b0>'. Reason: 'ValueError('ctypes objects containing pointers cannot be pickled')'

YuliangXiu commented 1 year ago

Very strange, this works well for me.

I updated it just now to resolve "numba" warnings, but I don't think it will solve your problem.

Zhangjzh commented 1 year ago

Hi @Zhangjzh and @Yuhuoo I have corrected some bugs and updated the scripts for training data generation dataset.md. Please re-download the SMPL-X.zip, and re-run the data generation scripts:
conda activate icon
python -m scripts.render_batch -headless -out_dir data/
Now data/thuman2/smplx/xxxx.obj and data/thuman2/scans/xxxx.obj are aligned perfectly

Very thanks for your reply and resolutions. But I have meet the problems as follows using the scripts, I have tried many ways to solve the problems and I failed. Can you meet the problems?

The complete error message are as follows:

Start Rendering thuman2 with 36 views, 512x512 size.
Output dir: ./debug/thuman2_36views
Rendering types: ['light', 'normal', 'depth']
  0%|                                                                                                     | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/icon/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/miniconda3/envs/icon/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/hy-tmp/ICON/scripts/render_batch.py", line 224, in <module>
    for _ in tqdm(
  File "/usr/local/miniconda3/envs/icon/lib/python3.8/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/usr/local/miniconda3/envs/icon/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f89f2e9f5b0>'. Reason: 'ValueError('ctypes objects containing pointers cannot be pickled')'

Hi, I ran into the same problem as you do. Did you solve the problem?

yxt7979 commented 1 year ago

you can try this: change for gpu_ids in range(NUMGPUS): for in range(PROC_PER_GPU): queue.put(gpu_ids ) to queue.put(0)

YuliangXiu commented 1 year ago

you can try this: change for gpu_ids in range(NUMGPUS): for in range(PROC_PER_GPU): queue.put(gpu_ids ) to queue.put(0)

My workstation contains two GPUs, if you are running on single-GPU machine, you could remove all these queue lines.

yxt7979 commented 1 year ago

Hi @Zhangjzh and @Yuhuoo

I have corrected some bugs and updated the scripts for training data generation dataset.md.

Please re-download the SMPL-X.zip, and re-run the data generation scripts:
conda activate icon
python -m scripts.render_batch -headless -out_dir data/
Now data/thuman2/smplx/xxxx.obj and data/thuman2/scans/xxxx.obj are aligned perfectly

您好，我遇到了楼主一样的问题（NC=0.2，P2S=4.)，在我重新下载SMPL-X.zip并render_batch之后，检查了两个.obj是对齐的，但是用git clone repo最新版的ICON代码训练模型还是有问题，甚至比改动前结果指标差距更大了（P2S 40）。在训练中，模型对cape的表现越来越差：虽然指标比较正常，但是我在测试的时候将cape换成thuman2验证集的5个数据发现效果也不好：由于我是服务器，无法用python -m lib.dataloader_demo -v -c ./configs/train/icon-filter.yaml来检验训练数据是否正确，请问可否有其他方法查看训练数据是否有问题？为什么同一个模型对cape和thuman2的结果差距很大？如何改正可以复现出论文中的训练精度呢？很期待您的指教，谢谢

Zhangjzh commented 1 year ago

Hi @Zhangjzh and @Yuhuoo I have corrected some bugs and updated the scripts for training data generation dataset.md. Please re-download the SMPL-X.zip, and re-run the data generation scripts:
conda activate icon
python -m scripts.render_batch -headless -out_dir data/
Now data/thuman2/smplx/xxxx.obj and data/thuman2/scans/xxxx.obj are aligned perfectly
您好，我遇到了楼主一样的问题（NC=0.2，P2S=4.)，在我重新下载SMPL-X.zip并render_batch之后，检查了两个.obj是对齐的，但是用git clone repo最新版的ICON代码训练模型还是有问题，甚至比改动前结果指标差距更大了（P2S 40）。在训练中，模型对cape的表现越来越差：虽然指标比较正常，但是我在测试的时候将cape换成thuman2验证集的5个数据发现效果也不好：由于我是服务器，无法用python -m lib.dataloader_demo -v -c ./configs/train/icon-filter.yaml来检验训练数据是否正确，请问可否有其他方法查看训练数据是否有问题？为什么同一个模型对cape和thuman2的结果差距很大？如何改正可以复现出论文中的训练精度呢？很期待您的指教，谢谢

After re-downloading the SMPL-X.zip, I got well results. 0_nc I didn't re-run the data generation scripts, because I ran into some trouble and I didn't figure out how to solve it. But when I ran python -m lib.dataloader_demo -v -c ./configs/train/icon-filter.yaml, I got well aligned results. Hope this will help you.

yxt7979 commented 1 year ago

hi! thank you so much! Do you use the latest code in repo? I didn`t change the code and get wrong results, could you please share your train loss? is it the same like this in the first epoch?

cucdengjunli commented 1 year ago

SAME QUESTION

yxt7979 commented 1 year ago

SAME QUESTION

after change trimesh version to 3.17.1 , the .obj under /thuman2/smplx/ is right

gushengbo commented 1 year ago

hello, excuse me, @YuliangXiu @yxt7979 download SMPL-x again, and change trimesh version to 3.17.1, but I also meet the same problem. data/thuman2/smplx/xxxx.obj and data/thuman2/scans/xxxx.obj are unaligned.

微信图片_20230209131610

gushengbo commented 1 year ago

hello, excuse me, @YuliangXiu @yxt7979 download SMPL-x again, and change trimesh version to 3.17.1, but I also meet the same problem. data/thuman2/smplx/xxxx.obj and data/thuman2/scans/xxxx.obj are unaligned.

I find that's wrong when I run the previous code, but it's right when I run lastest code. However, when I run lastest code, I meet the problem: (icon) shengbo@user-SYS:~/ICON-master$ python -m scripts.render_batch -debug -headless Start Rendering thuman2 with 36 views, 512x512 size. Output dir: ./debug/thuman2_36views Rendering types: ['light', 'normal', 'depth'] 0%| | 0/2 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/shengbo/anaconda3/envs/icon/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/shengbo/anaconda3/envs/icon/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/shengbo/ICON-master/scripts/renderbatch.py", line 254, in for in tqdm( File "/home/shengbo/anaconda3/envs/icon/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/home/shengbo/anaconda3/envs/icon/lib/python3.8/multiprocessing/pool.py", line 868, in next raise value multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f216d3fe7f0>'. Reason: 'ValueError('ctypes objects containing pointers cannot be pickled')'

gushengbo commented 1 year ago

@Zhangjzh @Yuhuoo hello, have you solved this problem? Thank you!

AndrewMorgan2 commented 1 year ago

@gushengbo did you find any solution?

dongdozizi commented 1 year ago

@Zhangjzh @Yuhuoo @AndrewMorgan2 hello, how did you download the smpl-x.zip, now I found the link is 404 not found.

AndrewMorgan2 commented 1 year ago

Please make a new issue if you have a new question

glorioushonor commented 1 year ago

Hi @Zhangjzh and @Yuhuoo

I have corrected some bugs and updated the scripts for training data generation dataset.md.

Please re-download the SMPL-X.zip, and re-run the data generation scripts:
conda activate icon
python -m scripts.render_batch -headless -out_dir data/
Now data/thuman2/smplx/xxxx.obj and data/thuman2/scans/xxxx.obj are aligned perfectly

Hi, I'd like to ask you some questions about this. You just recommended re-running the data generation script:

conda activate icon
python -m scripts.render_batch -headless -out_dir data/

Now the data/thuman2 / SMPLX/XXXX. Obj and data/thuman2 / scans/XXXX. Obj are aligned perfectly . But I wonder if I need to re-run the data generation script as following? python -m scripts.visibility_batch -out_dir data/ Because I tried to debug a model and found that the data/thuman2_36views/scans/xxxx/vis/ xxxx.obj changed. Although the difference did not significantly affect the visualization results of vedo, I wonder if not re-running the second script will have any effect on the training?

YuliangXiu commented 1 year ago

Hi @Zhangjzh and @Yuhuoo I have corrected some bugs and updated the scripts for training data generation dataset.md. Please re-download the SMPL-X.zip, and re-run the data generation scripts:
conda activate icon
python -m scripts.render_batch -headless -out_dir data/
Now data/thuman2/smplx/xxxx.obj and data/thuman2/scans/xxxx.obj are aligned perfectly
Hi, I'd like to ask you some questions about this. You just recommended re-running the data generation script:
conda activate icon
python -m scripts.render_batch -headless -out_dir data/
Now the data/thuman2 / SMPLX/XXXX. Obj and data/thuman2 / scans/XXXX. Obj are aligned perfectly . But I wonder if I need to re-run the data generation script as following? python -m scripts.visibility_batch -out_dir data/ Because I tried to debug a model and found that the data/thuman2_36views/scans/xxxx/vis/ xxxx.obj changed. Although the difference did not significantly affect the visualization results of vedo, I wonder if not re-running the second script will have any effect on the training?

Yes, you need to re-run the visibility computation since the visibility is computed on SMPL-X objs

YuliangXiu / ICON

Question about training #164