Open YGZone opened 11 months ago
When I train use sdk 0806, the error not appear.
@YGZone Did you solve this issue? I too had the same issues?
@YGZone Did you solve this issue? I too had the same issues?
No , I think mmcv cause this error, so I use my native cuda compile mmcv(TI specified version==1.4.8), but the problem not solve.
@YGZone Did you solve this issue? I too had the same issues?
No , I think mmcv cause this error, so I use my native cuda compile mmcv(TI specified version==1.4.8), but the problem not solve.
Have you posted this question in forum?
The issue is due to incompatibility between mmdetection / mmcv and PyTorch versions. We shall update to a recent mmdetection version hopefully in January 2024.
@YGZone Did you solve this issue? I too had the same issues?
No , I think mmcv cause this error, so I use my native cuda compile mmcv(TI specified version==1.4.8), but the problem not solve.
@mathmanu Does the issue has resolved?
@YGZone Did you solve this issue? I too had the same issues?
No , I think mmcv cause this error, so I use my native cuda compile mmcv(TI specified version==1.4.8), but the problem not solve.
Have you solved?
Same issue here, 100% reproducible. When GPU is enabled(number of enabled GPUs is irrelevant) DataContainer
object is added as a wrapper over tensor(at least validation set in my testing). So unless some config setting is missing, there is some bigger issue there... Could someone check please? So far it is not possible to train the model with GPU.... because of this.
To this day(20230301), I still use the CPU for training, my model is small so use CPU could train fast.
To this day(20230301), I still use the CPU for training, my model is small so use CPU could train fast.
You can comment out cuda 11.8 and mmcv installation comment in setup.py, try with native latest cuda and try out
when I train the yolo_nano_lite with cuda. Error is found :
AttributeError: DataContainer has no attribute size for type <class 'list'> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12206) of binary: /home/zyg/.pyenv/versions/py310/bin/python3
Here is my config file :
common: target_module: 'vision' task_type: 'detection' target_device: 'TDA4VM'
run_name can be any string, but there are some special cases:
dataset:
enable/disable dataset loading
training:
enable/disable training
compilation:
enable/disable compilation
I don't understand what cause the error, maybe some environments error, but my cuda is right and could use. When I train it use cpu, the error disappear.