ModelTC / United-Perception

United Perception
Apache License 2.0
426 stars 65 forks source link

单机单卡一直显示oom #47

Open Lllllolita opened 1 year ago

Lllllolita commented 1 year ago

本人使用up框架复现efl遇到如下问题,服务器显卡内存充足,但是一直显示oom,已经检查服务器显卡没有占用显存的僵尸进程,并且设置batch为1仍然显示oom,本人服务器配置如下: python:3.7 cuda :11.3 torch:1.10.0 gpu:RTX3090 config:configs/det/efl/efl_yolox_medium.yaml 请问可能是什么问题呢

yqyao commented 1 year ago

Maybe we need more error logs to reproduce it @Lllllolita

Lllllolita commented 1 year ago

这是运行python -m up train --config configs/det/efl/efl_yolox_medium_test.yaml --nm 1 --ng 1 --launch pytorch 2>&1 | tee log.train输出的log文件。 train.log

这两个文件是运行./easy_setup.sh输出的log文件。 compile.log compile_err.log

yqyao commented 1 year ago

the batch_size in your log is 8, maybe you need to recompile and export TORCH_CUDA_ARCH_LIST='3.5;5.0+PTX;6.0;7.0;8.0;8.6' in easy_setup.sh @Lllllolita

Lllllolita commented 1 year ago

非常感谢您的建议,现在我单机单卡设置batch是4是可以成功运行的,但是运行单机多卡仍然失败了,torch.cuda.aviable()显示的是False。 指令是:python -m up train --config configs/det/efl/efl_yolox_medium_test.yaml --nm 1 --ng 2 --launch pytorch 2>&1 | tee log.train train(1).log

yqyao commented 1 year ago

Maybe you need to check your cuda env ? @Lllllolita

happygds commented 1 year ago

@Lllllolita Hi, have you solved the problem torch.cuda.aviable() is False when the number of gpus > 1? I met the same problem now, how to solve it ?

happygds commented 1 year ago

@yqyao Why are the version requirements so difficult?