JDAI-CV / fast-reid

SOTA Re-identification Methods and Toolbox
Apache License 2.0
3.42k stars 837 forks source link

大数据集无法训练的问题 #632

Closed widgetxp closed 2 years ago

widgetxp commented 2 years ago

If you do not know the root cause of the problem, and wish someone to help you, please post according to this template:

Instructions To Reproduce the Issue:

使用FastRetri训练一个1800万图的训练集,读完label文件后屏幕会停止输出约5分钟,然后报错,信息如下: image 做了些尝试解决这个问题

  1. 网上一般建议减少dataloader的worker数目,由8降到4之后,才能开启训练,但是gpu利用率不足

  2. 尝试只用一半训练数据1100万图,仍是8个worker,也能正常开启训练。

  3. 数据加载队列的size由默认的10减小到2,不work。

  4. batch_size由2048降低到1024,不work。 想问下出现这种问题的根本原因是什么,以及解决方案。配置截图如下: image

  5. full code you wrote or full changes you made (git diff)

新增的数据集: image

  1. what exact command you run: python projects/FastRetri/train_net.py --config-file projects/FastRetri/configs/goods-image_retri.yml --num-gpus 4
github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity.

L1aoXingyu commented 2 years ago

建议提前离线生成 txt,然后从 txt 读取 index 进行训练,应该可以解决这个问题

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 14 days since being marked as stale.