TexasInstruments / edgeai-modelmaker

This repository has been moved. The new location is in https://github.com/TexasInstruments/edgeai-tensorlab
https://github.com/TexasInstruments/edgeai
Other
1 stars 0 forks source link

Train with cuda error. #11

Open YGZone opened 9 months ago

YGZone commented 9 months ago

when I train the yolo_nano_lite with cuda. Error is found :

AttributeError: DataContainer has no attribute size for type <class 'list'> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12206) of binary: /home/zyg/.pyenv/versions/py310/bin/python3

Here is my config file :

common: target_module: 'vision' task_type: 'detection' target_device: 'TDA4VM'

run_name can be any string, but there are some special cases:

# {date-time} will be replaced with datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
# {model_name} will be replaced with the name of the model
run_name: '{date-time}/{model_name}'

dataset:

enable/disable dataset loading

enable: True #False
# max_num_files: [750, 250] #None
# Object Detection Dataset Examples:
# -------------------------------------
# Example 1, (known datasets): 'widerface_detection', 'pascal_voc0712', 'coco_detection', 'udacity_selfdriving', 'tomato_detection', 'tiscapes2017_driving'
# dataset_name: widerface_detection
# -------------------------------------
# Example 2, give a dataset name and input_data_path.
# input_data_path could be a path to zip file, tar file, folder OR http, https link to zip or tar files
# for input_data_path these are provided with this repository as examples:
#    'http://software-dl.ti.com/jacinto7/esd/modelzoo/08_06_00_01/datasets/tiscapes2017_driving.zip'
#    'http://software-dl.ti.com/jacinto7/esd/modelzoo/08_06_00_01/datasets/animal_detection.zip'
# -------------------------------------
# Example 3, give image folders with annotation files (require list with values for both train and val splits)
# dataset_name: coco_detection
# input_data_path: ["./data/projects/coco_detection/dataset/train2017",
#                        "./data/projects/coco_detection/dataset/val2017"]
# input_annotation_path: ["./data/projects/coco_detection/dataset/annotations/instances_train2017.json",
#                        "./data/projects/coco_detection/dataset/annotations/instances_val2017.json"]
# -------------------------------------
# dataset_name: tiscapes2017_driving
# input_data_path: 'http://software-dl.ti.com/jacinto7/esd/modelzoo/08_06_00_01/datasets/tiscapes2017_driving.zip'
dataset_name: animal_detection
input_data_path: 'http://software-dl.ti.com/jacinto7/esd/modelzoo/08_06_00_01/datasets/animal_detection.zip'

training:

enable/disable training

enable: True #False
# Object Detection model chosen can be changed here if needed
# options are: 'yolox_s_lite', 'yolox_tiny_lite', 'yolox_nano_lite', 'yolox_pico_lite', 'yolox_femto_lite'
model_name: 'yolox_nano_lite'
training_epochs: 15 #30
batch_size: 8 #32
learning_rate: 0.005
num_gpus: 1 #1 #4

compilation:

enable/disable compilation

enable: True #False
tensor_bits: 8 #16 #32

I don't understand what cause the error, maybe some environments error, but my cuda is right and could use. When I train it use cpu, the error disappear.

YGZone commented 9 months ago

When I train use sdk 0806, the error not appear.

Mugutech62 commented 9 months ago

@YGZone Did you solve this issue? I too had the same issues?

YGZone commented 9 months ago

@YGZone Did you solve this issue? I too had the same issues?

No , I think mmcv cause this error, so I use my native cuda compile mmcv(TI specified version==1.4.8), but the problem not solve.

Mugutech62 commented 9 months ago

@YGZone Did you solve this issue? I too had the same issues?

No , I think mmcv cause this error, so I use my native cuda compile mmcv(TI specified version==1.4.8), but the problem not solve.

Have you posted this question in forum?

mathmanu commented 9 months ago

The issue is due to incompatibility between mmdetection / mmcv and PyTorch versions. We shall update to a recent mmdetection version hopefully in January 2024.

Mugutech62 commented 9 months ago

@YGZone Did you solve this issue? I too had the same issues?

No , I think mmcv cause this error, so I use my native cuda compile mmcv(TI specified version==1.4.8), but the problem not solve.

@mathmanu Does the issue has resolved?

Mugutech62 commented 8 months ago

@YGZone Did you solve this issue? I too had the same issues?

No , I think mmcv cause this error, so I use my native cuda compile mmcv(TI specified version==1.4.8), but the problem not solve.

Have you solved?

vladbph commented 8 months ago

Same issue here, 100% reproducible. When GPU is enabled(number of enabled GPUs is irrelevant) DataContainerobject is added as a wrapper over tensor(at least validation set in my testing). So unless some config setting is missing, there is some bigger issue there... Could someone check please? So far it is not possible to train the model with GPU.... because of this.

YGZone commented 7 months ago

To this day(20230301), I still use the CPU for training, my model is small so use CPU could train fast.

Mugutech62 commented 6 months ago

To this day(20230301), I still use the CPU for training, my model is small so use CPU could train fast.

You can comment out cuda 11.8 and mmcv installation comment in setup.py, try with native latest cuda and try out