aws-samples / amazon-sagemaker-tensorflow-object-detection-api

Train and deploy models using TensorFlow 2 with the Object Detection API on Amazon SageMaker
MIT No Attribution
45 stars 34 forks source link

Training job failed #28

Closed QuanNguyenAUT closed 2 years ago

QuanNguyenAUT commented 2 years ago

Hi there, I face an issue during the training task. I tried some solutions like downgrading the OpenCV ver but it does not work unfortunately.

===TRAINING THE MODEL== Traceback (most recent call last): File "model_main_tf2.py", line 31, in <module> from object_detection import model_lib_v2 File "/usr/local/lib/python3.8/dist-packages/object_detection/model_lib_v2.py", line 29, in <module> from object_detection import eval_util File "/usr/local/lib/python3.8/dist-packages/object_detection/eval_util.py", line 36, in <module> from object_detection.metrics import lvis_evaluation File "/usr/local/lib/python3.8/dist-packages/object_detection/metrics/lvis_evaluation.py", line 23, in <module> from lvis import results as lvis_results File "/usr/local/lib/python3.8/dist-packages/lvis/__init__.py", line 5, in <module> from lvis.vis import LVISVis File "/usr/local/lib/python3.8/dist-packages/lvis/vis.py", line 1, in <module> import cv2 File "/usr/local/lib/python3.8/dist-packages/cv2/__init__.py", line 181, in <module> bootstrap() File "/usr/local/lib/python3.8/dist-packages/cv2/__init__.py", line 175, in bootstrap if __load_extra_py_code_for_module("cv2", submodule, DEBUG): File "/usr/local/lib/python3.8/dist-packages/cv2/__init__.py", line 28, in __load_extra_py_code_for_module py_module = importlib.import_module(module_name) File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/usr/local/lib/python3.8/dist-packages/cv2/gapi/__init__.py", line 290, in <module> cv.gapi.wip.GStreamerPipeline = cv.gapi_wip_gst_GStreamerPipeline AttributeError: partially initialized module 'cv2' has no attribute 'gapi_wip_gst_GStreamerPipeline' (most likely due to a circular import)

And then it is a final message UnexpectedStatusException: Error for Training job tf2-object-detection-2022-10-13-10-21-03-764: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "" Command "/bin/sh -c ./run_training.sh --model_dir /opt/training --num_train_steps 100 --pipeline_config_path pipeline.config --sample_1_of_n_eval_examples 1", exit code: 1

I have googled several times but can't find anything. Could you help me figure out this issue. Many thanks

Othmane796 commented 2 years ago

Hi @QuanNguyenAUT, looking into this.

Othmane796 commented 2 years ago

Hi @QuanNguyenAUT. Issue was due to OpenCV as you guessed above. It is automatically installed when we install tensorflow object detection api in the Docker. Downgrading only "opencv-python" was not enough. Had to uninstall "opencv-python" and "opencv-python-headless" and then re-install downgraded version 4.5.2.52 (the same for both).

I update the Dockerfile to reflect this and tested a training job.

Please pull the repo or see changes in latest commit and let me know if you're still facing issues.

Cheers,

QuanNguyenAUT commented 2 years ago

Hi @Othmane796, many thanks. It saved my life. Now it has run smoothly. Appreciate your help.

Cheers,