GoogleCloudPlatform / vertex-ai-samples

Notebooks, code samples, sample apps, and other resources that demonstrate how to use, develop and manage machine learning and generative AI workflows using Google Cloud Vertex AI.
https://cloud.google.com/vertex-ai
Apache License 2.0
1.63k stars 806 forks source link

Detectron2 keypoints Rcnn Training using VertexAI custom Training Job #2629

Closed RiccardoMaistri closed 8 months ago

RiccardoMaistri commented 8 months ago

I am having issue trying to start a training of keypointsRcnn using detectron2 framework (exploiting the custom training job with vertex)

I forked the detectron2-train-docker-image and added the support for keypoints Rcnn, the addition regard a few files and cfg of detectron2 (regarding keypoints).

The thing that blow my mind is that if I run the code locally, everything works fine. The dataset contains two images with three keypoints each. The cfg added are simply:

["MODEL.ROI_KEYPOINT_HEAD.NUM_KEYPOINTS"] + ["3"]

["TEST.KEYPOINT_OKS_SIGMAS"] + [str(sigmas)]

and keypoint_names and keypoint_flip_map in dataset Metadata

If i run using container docker deployement the traceback error is this:


   File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
     "__main__", mod_spec)
   File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
     exec(code, run_globals)
   File "/home/appuser/trainer/task.py", line 295, in <module>
     args=(args,),
   File "/home/appuser/detectron2_repo/detectron2/engine/launch.py", line 82, in launch
     main_func(*args)
   File "/home/appuser/trainer/task.py", line 279, in main
     trainer.train()
   File "/home/appuser/detectron2_repo/detectron2/engine/defaults.py", line 484, in train
     super().train(self.start_iter, self.max_iter)
   File "/home/appuser/detectron2_repo/detectron2/engine/train_loop.py", line 149, in train
     self.run_step()
   File "/home/appuser/detectron2_repo/detectron2/engine/defaults.py", line 494, in run_step
     self._trainer.run_step()
   File "/home/appuser/detectron2_repo/detectron2/engine/train_loop.py", line 267, in run_step
     data = next(self._data_loader_iter)
   File "/home/appuser/detectron2_repo/detectron2/data/common.py", line 234, in __iter__
     for d in self.dataset:
   File "/home/appuser/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
     data = self._next_data()
   File "/home/appuser/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
     return self._process_data(data)
   File "/home/appuser/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
     data.reraise()
   File "/home/appuser/.local/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
     raise exception
 ValueError: Caught ValueError in DataLoader worker process 1.
 Original Traceback (most recent call last):
   File "/home/appuser/.local/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
     data = fetcher.fetch(index)
   File "/home/appuser/.local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
     data.append(next(self.dataset_iter))
   File "/home/appuser/detectron2_repo/detectron2/data/common.py", line 201, in __iter__
     yield self.dataset[idx]
 File "/home/appuser/detectron2_repo/detectron2/data/common.py", line 90, in __getitem__
   data = self._map_func(self._dataset[cur_idx])
 File "/home/appuser/detectron2_repo/detectron2/utils/serialize.py", line 26, in __call__
   return self._obj(*args, **kwargs)
 File "/home/appuser/detectron2_repo/detectron2/data/dataset_mapper.py", line 189, in __call__
   self._transform_annotations(dataset_dict, transforms, image_shape)
 File "/home/appuser/detectron2_repo/detectron2/data/dataset_mapper.py", line 128, in _transform_annotations
   for obj in dataset_dict.pop("annotations")
 File "/home/appuser/detectron2_repo/detectron2/data/dataset_mapper.py", line 129, in <listcomp>
   if obj.get("iscrowd", 0) == 0
 File "/home/appuser/detectron2_repo/detectron2/data/detection_utils.py", line 314, in transform_instance_annotations
   annotation["keypoints"], transforms, image_size, keypoint_hflip_indices
 File "/home/appuser/detectron2_repo/detectron2/data/detection_utils.py", line 360, in transform_keypoint_annotations
 "contains {} points!".format(len(keypoints), 
 ValueError: Keypoint data has 3 points, but metadata contains 15 points!

Specifications

gericdong commented 8 months ago

@minwoo33park would you help take a look at this if you can? Thanks.