Closed hbzhang closed 5 years ago
Which dataset are you trying to train?
I do not have good machine.
I just want to run very basic CIFAR trying to figure out this codeset.
Also, can I run it across a few nanos like 20 in parallel?
Thanks!!
4 GB should be ok to run CIFAR. If not, you could decrease init_channels or layers in https://github.com/D-X-Y/GDAS/blob/master/scripts-cnn/train-cifar.sh#L36. The codes support GPU parallel.
Thanks. Tried
--init_channels 1 --layers 2 \
does not work
.....
Train model from scratch without pre-trained model or snapshot
==>>[2019-08-12-16:41:37] [Epoch=000/600] [Need: 00:00:00] LR=0.0250 ~ 0.0250, Batch=96
THCudaCheck FAIL file=/media/nvidia/WD_BLUE_2.5_1TB/pytorch-v1.1.0/aten/src/THCUNN/generic/SpatialAveragePooling.cu line=184 error=7 : too many resources requested for launch
Traceback (most recent call last):
File "./exps-cnn/train_base.py", line 89, in
Tried --init_channels 1 --layers 1 \
Does not work either ...
Train model from scratch without pre-trained model or snapshot
==>>[2019-08-12-16:43:46] [Epoch=000/600] [Need: 00:00:00] LR=0.0250 ~ 0.0250, Batch=96
Traceback (most recent call last):
File "./exps-cnn/train_base.py", line 89, in
Please try with init_channels >= 2 and layers >= 2
With this setting
init_channels = 2 and layers = 2
It complains the followings:
==>>[2019-08-12-18:51:54] [Epoch=000/600] [Need: 00:00:00] LR=0.0250 ~ 0.0250, Batch=96
THCudaCheck FAIL file=/media/nvidia/WD_BLUE_2.5_1TB/pytorch-v1.1.0/aten/src/THCUNN/generic/SpatialAveragePooling.cu line=184 error=7 : too many resources requested for launch
Traceback (most recent call last):
File "./exps-cnn/train_base.py", line 89, in
It seems a hardware problem. I can run successfully on my GPU.
So are you able to try the 4GB GPU memory situation with your GPU?
Which file(s) I need to look into to debug the resources issue?
I think Jetson may can run it in parallel... so how to make it run in a parallel (cluster) fashion?
Thanks!
I did not have a 4 GPU memory situation, but I checked the GPU memory usage, which is lower than 4 GPU. The codes (by default) use parallel. Regarding the resources issue, I'm not familiar with how to debug that.
Thanks. I can look into it. There might be some differences between Jetson nano and typical GPU in hardware architecture. But it is certainly interesting to compare....
No worries. sorry, I'm not familiar with Jetson nano. I would temporarily close this issue, and please feel free to reopen it if you want.
There, I am trying to train it on Jetson nano - 4GB memory.
Is this possible?
Can I reduce the resources requested?
Thanks!