TexasInstruments / edgeai-benchmark

This repository has been moved. The new location is in https://github.com/TexasInstruments/edgeai-tensorlab
https://github.com/TexasInstruments/edgeai
Other
3 stars 0 forks source link

YOLOv5 Ti Lite Custom Model Compilation Process #9

Open dpetersonVT23 opened 2 years ago

dpetersonVT23 commented 2 years ago

Is this the correct repository to compile and deploy a custom trained YOLOv5 model from the YOLOv5 Ti repository (https://github.com/TexasInstruments/edgeai-yolov5)?

I am having trouble figuring out where to start in this repo, ie where to put the trained weights and begin compilation. I have run the setup script and already trained my custom model using the edgeai-yolov5 repository.

Should I benchmark or compile first? What are the steps to do so successfully? Any guidance from this point is appreciated.

mathmanu commented 2 years ago

You can use this script for your custom compile model: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_custom_pc.sh

Several models are listed in that file for convenience. You can comment out all the models except yolov5, since your's is specifically about yolov5.

You can also look at this tutorial to understand single model compilation: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_tutorials_pc.sh

As you know our default benchmark script that compiles all models is: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_benchmarks_pc.sh But you can also run only one specific model by selecting that model's id in the settings yaml file: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/settings_base.yaml#L65

dpetersonVT23 commented 2 years ago

Thank you for the quick response @mathmanu, will look more closely at these files and let you know if I have any questions!

dpetersonVT23 commented 2 years ago

@mathmanu I have commented out everything except for the YOLOv5, I replaced the paths to the .onnx and .prototxt files for my custom model. Is there something else I have to do regarding datasets? I am getting errors running the run_custom_oc.sh script. Thanks again!

mathmanu commented 2 years ago

Comment out these datasets that are not needed in your case: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py#L151

Also set the dataset path appropriately: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py#L105 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py#L112

That should be enough.

dpetersonVT23 commented 2 years ago

@mathmanu I am confused why the compilation needs access to the dataset, can you help me understand that?

Got the cls and seg commented out, for the get_imagedet_dataset_loaders function, I have my dataset setup as required for YOLO traininig (images and labels directories each with train, test, and val subdirectories), what format needs to be returned from this function? The code currently in it seems specific to COCO, so Im not sure what to replace it with.

mathmanu commented 2 years ago

compilation requires a set of images. edgeai-benchmark compilation works with datasets and it can also generate accuracy. You can provide your own dataset there instead of coco, but the data loader there understands coco format.

If you are looking for a simple script that does compilation only with a few images, you can use our low level tidl tools repository: https://github.com/TexasInstruments/edgeai-tidl-tools

dpetersonVT23 commented 2 years ago

@mathmanu My goal is to prepare this model for deployment on the Beaglebone AI and have it compiled such that it will take advantage of the hardware accelerators. Is the easiest way to do this through the repo you linked or through the benchmark_custom script?

I am more than fine to skip the actual benchmarking for now, my immediate goal is to achieve compilation and a method of deployment for this board. Thanks for your timely help!

dpetersonVT23 commented 2 years ago

@mathmanu I continued working with tutorial_detection.ipynb, same idea as the benchmark_custom.py script. When I run the cell with tools.run_accuracy, the execution of the cell hangs and does not complete, it remains at 0% task completion. The model path, model file, and pipline config all print out, but nothing else after that.

When I use the same contents in the benchmark_custom.py script it has this output on the terminal (running run_custom_pc.sh), let me know your insight on this error output:

Final number of subgraphs created are : 1, - Offloaded Nodes - 242, Total Nodes - 242 2022-08-11 13:47:30.858408637 [E:onnxruntime:, inference_session.cc:1311 operator()] Exception during initialization: /home/a0230315/workarea/onnxrt/onnxruntime/include/onnxruntime/core/graph/graph.h:1300 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const nodeindex < nodes.size() was false. Validating no unexpected access using an invalid node_index. Got:65 Max:1

[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /home/a0230315/workarea/onnxrt/onnxruntime/include/onnxruntime/core/graph/graph.h:1300 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const nodeindex < nodes.size() was false. Validating no unexpected access using an invalid node_index. Got:65 Max:1

Traceback (most recent call last): File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/pipeline_runner.py", line 135, in _run_pipeline accuracy_result = accuracy_pipeline(description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 104, in call param_result = self._run(description=description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 146, in _run output_list = self._infer_frames(description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 194, in _infer_frames is_ok = session.start_infer() File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/sessions/onnxrt_session.py", line 80, in start_infer self.interpreter = self._create_interpreter(is_import=False) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/sessions/onnxrt_session.py", line 132, in _create_interpreter provider_options=[runtime_options, {}], sess_options=sess_options) File "/home/mm282681/miniconda3/envs/benchmark/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in init self._create_inference_session(providers, provider_options) File "/home/mm282681/miniconda3/envs/benchmark/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 315, in _create_inference_session sess.initialize_session(providers, provider_options) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /home/a0230315/workarea/onnxrt/onnxruntime/include/onnxruntime/core/graph/graph.h:1300 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const nodeindex < nodes.size() was false. Validating no unexpected access using an invalid node_index. Got:65 Max:1

TASKS | 100%|██████████||

mathmanu commented 2 years ago

Is it possible to share your .onnx and .prototxt file so that we can take a look

CC: @debapriyamaji

dpetersonVT23 commented 2 years ago

Sure, I have attached them in a zip. The only modification I have made after exporting to .onnx from .pt is changing the confidence_threshold in the .prototxt from 0.005 to 0.3.

Please let me know if there is anything else I can provide, thanks. @mathmanu @debapriyamaji test_640s_ti_lite.zip

To note: I saw this morning that I do have an artifacts folder with some .txts and a subdirectory with some .bins after running this script even though I still get the above error. Is there any code where I can test the compiled weights (artifacts I assume is an equivalent term) to confirm if they compiled correctly and work as expected when running inference?

mathmanu commented 2 years ago

While we are waiting for @debapriyamaji to take a look at what you shared, you can try this: The compiled artifact is supposed to work in the EVM using the EdgeAI SDK: https://www.ti.com/tool/download/PROCESSOR-SDK-LINUX-SK-TDA4VM (You can package the artifact by running ./run_package_artifact.sh and try to use it in the EVM.

dpetersonVT23 commented 2 years ago

Sounds good thanks @mathmanu. the artifacts were packaged successfully with that script, will see what I can do in the EVM with the EdgeAI SDK in the meantime.

dpetersonVT23 commented 2 years ago

Has @debapriyamaji has a chance to review the files? Still working on the integration test on the board.

mathmanu commented 2 years ago

Make sure that you are using this configuration and that input_optimization is set to False: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py#L260

Several customers have questions about yolov5, so @debapriyamaji is integrating yolov5 into our https://github.com/TexasInstruments/edgeai-modelmaker We hope to release the update in couple of days. Then the only thing that will need to provide is your dataset in COCO format and everything else including compilation will be taken care by this tool.

Also take a look at a similar thread that reported issues and it seems to be resolved by using the example in benchmark_custom.py: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1121823/tda4vm-edgeai-benchmark-yolov5-model-compilation-error/4164033#4164033

dpetersonVT23 commented 2 years ago

@mathmanu

Confirmed input_optimization is set to False and the rest of the configuration matches asides from the paths and the output feature 16 bits names list:

    'imagedet-7': dict(
        task_type='detection',
        calibration_dataset=imagedet_calib_dataset,
        input_dataset=imagedet_val_dataset,
        preprocess=preproc_transforms.get_transform_onnx(640, 640,  resize_with_pad=True, backend='cv2', pad_color=[114,114,114]),
        session=sessions.ONNXRTSession(**utils.dict_update(onnx_session_cfg, input_optimization=False, input_mean=(0.0, 0.0, 0.0), input_scale=(0.003921568627, 0.003921568627, 0.003921568627)),
            runtime_options=utils.dict_update(settings.runtime_options_onnx_np2(),
                                {'object_detection:meta_arch_type': 6,
                                 'object_detection:meta_layers_names_list':f'../edgeai-yolov5/weights/test_640s_ti_lite/test_640s_ti_lite.prototxt',
                                 'advanced_options:output_feature_16bit_names_list':'onnx::Reshape_291, onnx::Reshape_347, onnx::Reshape_403'
                                 }),
            model_path=f'../edgeai-yolov5/weights/test_640s_ti_lite/test_640s_ti_lite.onnx'),
        postprocess=postproc_transforms.get_transform_detection_yolov5_onnx(squeeze_axis=None, normalized_detections=False, resize_with_pad=True, formatter=postprocess.DetectionBoxSL2BoxLS()), 

        metric=dict(label_offset_pred=datasets.coco_det_label_offset_80to90(label_offset=1)),
        model_info=dict(metric_reference={'accuracy_ap[.5:.95]%':37.4})
    ),

Sounds good, hopefully that removes any bugs in the process compiling a model from the edgeai-yolov5 repository.

I reviewed the linked thread, the only difference I see, and maybe you noticed this as well if you compared the .prototxts, is that my .prototxt has only 3 yoloparam blocks/layers whereas the one in this thread and others I have seen have 4 when training on YOLOv5s6 from Ti. Additionally, the "input" attributes are mapped to integer values, whereas mine contains an "onnx::Reshape" prefix. I did not deviate or do any significant customizations in the training process, so I'm not sure why these differences are appearing.

mathmanu commented 1 year ago

Hi Can you try removing your onnx package and install the version 1.8.1. Then export the onnx model once again and try. The reason why I am asking is because we just integrated edgeai-yolov5 into edgeai-modelmaker and it worked without issue.

Or you can wait for a day and we shall update edgeai-modelmaker tomorrow with yolov5 support.

dpetersonVT23 commented 1 year ago

Im assuming you are referring to the export process from .pt to .onnx and .prototxt in the edgeai-yolov5 repository. I will attempt to downgrade onnx package from 1.11.0 to 1.8.1 and export, but the requirements.txt says >=1.9.0 for onnx package in this repository. My separate conda environment for the edgeai-benchmark repository already had 1.8.1 installed. If this does not solve the issue I will wait for your release in the near future and go from there, thank you!

dpetersonVT23 commented 1 year ago

@mathmanu I received this error when trying to export to .onnx and .prototxt in edgeai-yolov5 repo using onnx==1.8.1:

ImportError: /home/user/miniconda3/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /lib/x86_64-linux-gnu/libprotobuf.so.23)

Followed suggested fixes online and nothing worked. Script worked once I ran with 1.9.0 or above, consistent with the requirements. Let me know if you were referring to a different script for exporting to onnx model.

mathmanu commented 1 year ago

Can you try to use a lower Python version (You can create and environment in miniconda). Try Python3.6

dpetersonVT23 commented 1 year ago

That export worked. The .prototxt no longer has the "onnx::Reshape_" prefixes I mentioned, however, there are still only 3 yolo_param blocks, but this may be normal since I am training a smaller model with only 1 class. When I run the custom benchmark script to compile and get the artifacts, I get an error about the Provider Type. I am not sure why I do not have the TIDL version, but the CPU version I do have, let me know your thoughts.

6: UserWarning: Specified provider 'TIDLExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider' "Available providers: '{}'".format(name, ", ".join(available_provider_names)))

Unknown Provider Type: TIDLExecutionProvider Traceback (most recent call last): File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/pipeline_runner.py", line 135, in _run_pipeline accuracy_result = accuracy_pipeline(description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 104, in call param_result = self._run(description=description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 146, in _run output_list = self._infer_frames(description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 194, in _infer_frames is_ok = session.start_infer() File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/sessions/onnxrt_session.py", line 80, in start_infer self.interpreter = self._create_interpreter(is_import=False) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/sessions/onnxrt_session.py", line 132, in _create_interpreter provider_options=[runtime_options, {}], sess_options=sess_options) File "/home/mm282681/.local/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "/home/mm282681/.local/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 379, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) RuntimeError: Unknown Provider Type: TIDLExecutionProvider

mathmanu commented 1 year ago

This is good progress. We need to specify Python and onnx versions in our requirements. CC: @debapriyamaji

RuntimeError: Unknown Provider Type: TIDLExecutionProvider The error may mean that the correct onnxruntime for TIDL is not installed: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/setup.sh#L69

Or it may mean that the tidl_tools folder is not found: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_setup_env.sh#L52 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_custom_pc.sh#L36 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/setup.sh#L62 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/setup.sh#L71 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/setup.sh#L71

dpetersonVT23 commented 1 year ago

Awesome to hear that has been figured out.

I'll rerun the setup script and then run the custom benchmark script again and let you know if that fixes the issue.

dpetersonVT23 commented 1 year ago

@mathmanu Ran the setup script again, same error occurred when running run_custom_pc.sh. Manually removed the tidl_tools directory and the tidl_tools.tar.gz file. After running setup.sh, I manually ran line 69 and 71 from setup.sh to confirm both were installed correctly as well.

Running run_setup_env with pc argument prints the correct path for tidl_tools folder.

If it helps at all, I am working on a Linux running Ubuntu 22.04, but I do not think this is a contributing factor to the error.

tidl_tools folder has the following contents: ├── device_config.cfg ├── gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu ├── itidl_rt.h ├── libtidl_onnxrt_EP.so ├── libtidl_tfl_delegate.so ├── libvx_tidl_rt.so ├── libvx_tidl_rt.so.1.0 ├── PC_dsp_test_dl_algo.out ├── ti_cnnperfsim.out ├── tidl_graphVisualiser.out ├── tidl_graphVisualiser_runtimes.out ├── tidl_model_import_onnx.so ├── tidl_model_import_relay.so └── tidl_model_import_tflite.so

mathmanu commented 1 year ago

Yolov5 support has been now added in edgeai-modelmaker: https://github.com/TexasInstruments/edgeai-modelmaker You can change the model to be trained in the comfig file: https://github.com/TexasInstruments/edgeai-modelmaker/blob/master/config_detection.yaml#L44

We still need to enhance the yolov5 support - for example changing the learning rate is not yet enabled - but wanted to release a version quickly as you have been waiting. Hopefully we can do the pending things an push a complete version tomorrow.

Please try and let us know. Be sure to create a fresh python 3.6 environment for this.

dpetersonVT23 commented 1 year ago

I encountered an error when trying to run the detection example.

I have CUDA 11.3. After running setup_all.sh and running the detection example I encountered this error:

RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.3 and torchvision has CUDA Version=11.7. Please reinstall the torchvision that matches your PyTorch install.

PyTorch seems to be right, but I am not sure why torchvision was installed with CUDA 11.7, seems to be related to edgeai-torchvision. I removed the edgeai-torchvision directory and pip installed the correct torchvision for CUDA 11.3. The script was then missing modules from edgeai-torchvision. Will have to spend time figuring out a workaround for this.

In the meantime, working on converting VoTT export to COCO format.

mathmanu commented 1 year ago

If you have multiple CUDA versions (for example both CUDA 11.3 and CUDA 11.7) installed, then it is possible that the setup of edgeai-torchvision can take the wrong CUDA version. This can be corrected by setting LD_LIBRARY_PATH to the correct CUDA version.

mathmanu commented 1 year ago

Also, we have been using a gcc version of 7.x (specifically 7.5). We have noticed issues when installing edgeai-torchvision when gcc version is 5.x.

You are using Ubuntu 22.04, which could have a different gcc version. If the above doesn't solve the issue, you can try after installing gcc-7 and g++-7

It is easy to have multiple gcc versions and switch between then using update-alternatives

mathmanu commented 1 year ago

If you still have issues, you can use the docker build scripts that we have given here (https://github.com/TexasInstruments/edgeai-modelmaker) to bring up a docker container and use modelmaker inside the container.

dpetersonVT23 commented 1 year ago

Used the docker build scripts, ran the setup_all script, and ran the detection and classification examples. I received this error, seems to be regarding downloading the dataset. I received the same error running both examples.

argv: ['./scripts/run_modelmaker.py', 'config_detection.yaml'] Model:yolox_s_lite_mmdet TargetDevice:TDA4VM FPS(Estimate):107 downloading from http://software-dl.ti.com/jacinto7/esd/modelzoo/latest/datasets/tiscapes2017_driving.zip to /home/edgeai/code/edgeai-modelmaker/data/projects/tiscapes2017_driving/other/download/tiscapes2017_driving.zip HTTP Error 403: Forbidden Traceback (most recent call last): File "./scripts/run_modelmaker.py", line 127, in main(config) File "./scripts/run_modelmaker.py", line 66, in main run_params_file = model_runner.prepare() File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/runner.py", line 96, in prepare self.dataset_handling.run() File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/datasets/init.py", line 121, in run self.params.dataset.extract_path) File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/utils/download_utils.py", line 183, in download_file progressbar_creator=progressbar_creator) File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/utils/download_utils.py", line 138, in download_and_extract extract_success = extract_files(dataset_url, extract_root) File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/utils/download_utils.py", line 70, in extract_files if download_file.endswith('.tar'): AttributeError: 'NoneType' object has no attribute 'endswith'

mathmanu commented 1 year ago

HTTP Error 403: Forbidden

Network issue?

dpetersonVT23 commented 1 year ago

Ran speed test, no network issues.

Request was received by server but I was denied access to the file.

Will work on converting data labels to COCO JSON and starting training.

dpetersonVT23 commented 1 year ago

I am attempting to train with custom data. I have converted everything to COCO JSON format and modified the config file appropiately with the correct paths. When I run the training scripts with the custom config, I get this error about the target device.

argv: ['./scripts/run_modelmaker.py', 'config_detection.yaml'] Traceback (most recent call last): File "./scripts/run_modelmaker.py", line 127, in main(config) File "./scripts/run_modelmaker.py", line 66, in main run_params_file = model_runner.prepare() File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/runner.py", line 106, in prepare self.model_compilation = compilation.edgeai_benchmark.ModelCompilation(self.params) File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/compilation/edgeai_benchmark.py", line 51, in init self.settings_file = jai_benchmark.get_settings_file(target_machine=self.params.common.target_machine, with_model_import=True) File "/home/edgeai/code/edgeai-benchmark/jai_benchmark/init.py", line 38, in get_settings_file assert target_machine in supported_machines, f'target_machine must be one of {supported_machines}' AssertionError: target_machine must be one of ('pc', 'j7')

mathmanu commented 1 year ago

edgeai-benchmark needed to be updated on github. I have done that just now. Can you pull it and try again.

dpetersonVT23 commented 1 year ago

Got a bunch of errors regarding divergent branches when trying to pull benchmark individually. Just going to copy settings_base.yaml

mathmanu commented 1 year ago

git rebase may not have worked. What you suggested may be the easiest option.

mathmanu commented 1 year ago

Just going to copy settings_base.yaml

Why is this copy required?

dpetersonVT23 commented 1 year ago

I noticed that the only thing updated in the benchmark commit was that file, so I edited my comment. You think I should remove everything and run setup all script again?

mathmanu commented 1 year ago

You think I should remove everything and run setup all script again?

That may be the simplest option

mathmanu commented 1 year ago

git may have some options to pull without trying to merge or rebase with your local version. that would work.

Or you can just delete edgeai-benchmark, clone it and run it's setup.sh

mathmanu commented 1 year ago

The change is not just in settings_base.yaml - the history of the branch has changed. That is why you got the git conflicts.

dpetersonVT23 commented 1 year ago

Got it, rerunning setup scripts.

dpetersonVT23 commented 1 year ago

Got the training started, only works on CPU. I followed the steps for GPU from docker container and the verification of the GPU presence works when I run

sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

but get a CUDA unavailable error in training. My guess is it relates to different versions of CUDA for torch and torchvision. Only training for 50 epochs, so we will see what happens, should take about 30-60 mins. Should I expect it to package the artifacts automatically?

Traceback (most recent call last): File "./scripts/run_modelmaker.py", line 127, in main(config) File "./scripts/run_modelmaker.py", line 70, in main model_runner.run() File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/runner.py", line 118, in run self.model_training.run() File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/training/edgeai_yolov5/detection.py", line 300, in run hyp=args_yolo['hyp'], project=args_yolo['project'], name='') File "/home/edgeai/code/edgeai-yolov5/train.py", line 581, in run main(opt) File "/home/edgeai/code/edgeai-yolov5/train.py", line 471, in main device = select_device(opt.device, batch_size=opt.batch_size) File "/home/edgeai/code/edgeai-yolov5/utils/torch_utils.py", line 73, in select_device assert torch.cuda.is_available(), f'CUDA unavailable, invalid device {device} requested' # check availability AssertionError: CUDA unavailable, invalid device 0 requested

mathmanu commented 1 year ago

I am not really expert in using CUDA within docker. But I found a related documentation here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

You can consult someone who is knowledgeable in this field.

mathmanu commented 1 year ago

Note that edgeai-yolov5 integration into edgeai-modelmaker is not complete (especially hyperparameters) - so that accuracy that you get may be less than optimal - especially since our default number of epochs is only 30.

We hope to fix this issue in one or two days.

dpetersonVT23 commented 1 year ago

Sounds good, I will let it train on CPU and see if it compiles the model correctly, if it does I will spend more time getting GPU to work in docker. No worries on the hyperparameters right now, just want an MVP on BBAI.

dpetersonVT23 commented 1 year ago

After it finishes training, which scripts would you like me to use for export to .onnx and .prototxt? And then from there to the compiled artifacts?

mathmanu commented 1 year ago

I will spend more time getting GPU to work in docker

If you find the right solution, please write it here, so that I can try it out and also document it in this repository.

dpetersonVT23 commented 1 year ago

I will spend more time getting GPU to work in docker

If you find the right solution, please write it here, so that I can try it out and also document it in this repository.

Will do for sure.

mathmanu commented 1 year ago

After it finishes training, which scripts would you like me to use for export to .onnx and .prototxt? And then from there to the compiled artifacts?

This does both training and compilation - nothing else need to be done. After this finishes running, you will get a .tar.gz file that you can directly use in PROCESSOR-SDK-LINUX-SK-TDA4VM (a.k.a. EdgeAI SDK ) on the EVM https://www.ti.com/tool/PROCESSOR-SDK-J721E

dpetersonVT23 commented 1 year ago

Sounds great, hopefully that will work smoothly for deployment on the BBAI board.

Will let you know when training finishes.