cj-mills / christianjmills

My personal blog
https://christianjmills.com/
Other
2 stars 0 forks source link

posts/pytorch-train-object-detector-yolox-tutorial/ #41

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

Christian Mills - Training YOLOX Models for Real-Time Object Detection in Pytorch

Learn how to train YOLOX models for real-time object detection in PyTorch by creating a hand gesture detection model.

https://christianjmills.com/posts/pytorch-train-object-detector-yolox-tutorial/

aforadil commented 1 year ago

Hi. I am actually new to machine learning and following your tutorials for a project. I just wanted to confirm that I can export the model to ONXX by following the next tutorial and then I can use it in Unity using the Barracuda Packages for yolox that you have developed or by following the tutorial (Real-Time Object Detection in Unity With ONNX and DirectML Pt. 1). Am I right?

cj-mills commented 1 year ago

Hi @aforadil,

That's correct. You can export the model with the code in the follow-up tutorial. Use the specified settings for Barracuda in this section if you plan to use the model with either the ONNX-DirectML tutorial or the Barracuda YOLOX package.

You can swap the model and colormap file in the Barracuda YOLOX demo project without any additional changes.

The code in the ONNX-DirectML tutorial applies normalization to the model input, so you'll need to remove the normalization steps (i.e., the subtraction and division operations) from the PerformInference function in the dllmain.cpp file.

Also, use the TensorFlow.js plugin if you need to run the model in a browser. The model runs with Barracuda in WebGL builds, but the inference speed is slower and has weird glitches when changing input dimensions.

aforadil commented 1 year ago

Thanks for a quick response @cj-mills.

aforadil commented 1 year ago

Hi @cj-mills, I was following your tutorial. The code runs fine in Collab. When I tried running it in a Local Python Environment with Mamba, following error comes up in Prompt when the "Train the Model" cell is executed. I have tried executing all in a sequence but still error persists. Looks like some issue with multiprocessing. Would be great if you can help out in this. Thanks Error: [I 17:51:29.357 NotebookApp] Traceback (most recent call last): File "", line 1, in File "C:\Users\Madil2\mambaforge\envs\pytorch-env\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\Madil2\mambaforge\envs\pytorch-env\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) AttributeError: Can't get attribute 'HagridDataset' on <module 'main' (built-in)>

cj-mills commented 1 year ago

Hi @aforadil,

Python Multiprocessing on Windows is slightly weird and even more so with Jupyter notebooks.

The fix for the AttributeError is to place the HagridDataset class in a separate Python file and import it. I just added a version of the training notebook for Windows to the GitHub repository.

I'll add a note to the tutorial to use that one when running on Windows.

Edit: Also download the windows_utils.py file and place it in the same folder as the notebook.

aforadil commented 1 year ago

Great. Thanks.

Xantango commented 11 months ago

Hi... can't i just use the standard coco annotation json file to train my custom model? As my model is having images with multiple classes, and i noticed that yours is pointing each json file to a class folder. Thanks in advance :)

Xantango commented 11 months ago

By the way, following previous question, i'm already having a custom model trained on Yolov8. however, when i tried to use it with Unity (after exporting it to Onnx), it gaves me some Reshape issue. but when i used your model, it was working perfectly. that's why i want to train my custom data with yolox. I just mention this, as you might having an answer for my main issue :)

cj-mills commented 11 months ago

Hi @Xantango,

For your first question, you can use whichever annotation format you wish, but you'll need to update the relevant code in the notebook for that specific format. The method to read the annotation data used in the tutorial is simply what the HaGRID dataset requires.

I have a notebook in this tutorial's GitHub Repository for working with the COCO dataset. I need to clean it up slightly, but the code is functional.

I am debating whether the COCO notebook warrants a dedicated post to cover the changes in how to read the annotation data.

As for your second question, I assume you are trying to use the model with Unity's Barracuda inference library. Barracuda has limited ONNX operator support, and part of the exported YOLOv8 model likely has unsupported operators. That is why I have different export settings for Barracuda in the follow-up post.

Sentis, the replacement for Barracuda, should have better operator support. Unfortunately, Sentis is only available through a closed beta program, and I only got access this week. I plan to port my unity-barracuda-inference-yolox package to Sentis to compare against Barracuda, but I have not had a chance to do so yet.

Xantango commented 11 months ago

Hi @cj-mills, Very informative. Thank you sooo much.

Xantango commented 11 months ago

Hi @cj-mills, This tutorial's code and pytorch-yolox-object-detector-training-coco.ipynb are running perfectly fine on colab. However, when i ran them on Jupyter, including pytorch-yolox-object-detector-training-windows.ipynb, the training section keep running forever, with neither errors nor progress. Any idea?

Xantango commented 11 months ago

Uhhh, it was the attribute issue

cj-mills commented 11 months ago

@Xantango Were you able to resolve the issue you encountered? I don't have a version of the COCO notebook prepared for Windows, so you would need to make the same changes as in the notebook for the HaGRID dataset and place the windows_utils.py file in the same folder.

Xantango commented 11 months ago

Hi @cj-mills, yes, i have combined parts from this tutorial, COCO version, and windows version, to suit what i needed. and it is working perfectly fine. thank you soooo much for the great well explained tutorial.

Zafoue commented 8 months ago

Hello , After Runing the program in Jupyter i have this issue after runing dataset_sample = train_dataset[0]

annotated_tensor = draw_bboxes( image=(denorm_img_tensor(dataset_sample[0], norm_stats)255).to(dtype=torch.uint8), boxes=dataset_sample[1]['boxes'], labels=[class_names[int(i.item())] for i in dataset_sample[1]['labels']], colors=[int_colors[int(i.item())] for i in dataset_sample[1]['labels']] )

tensor_to_pil(annotated_tensor)

"NameError Traceback (most recent call last) Cell In[32], line 1 ----> 1 dataset_sample = train_dataset[0] 3 annotated_tensor = draw_bboxes( 4 image=(denorm_img_tensor(dataset_sample[0], norm_stats)255).to(dtype=torch.uint8), 5 boxes=dataset_sample[1]['boxes'], 6 labels=[class_names[int(i.item())] for i in dataset_sample[1]['labels']], 7 colors=[int_colors[int(i.item())] for i in dataset_sample[1]['labels']] 8 ) 10 tensor_to_pil(annotated_tensor)

File ~\YOLOX\windows_utils.py:61, in HagridDataset.getitem(self, index) 59 annotation = self._annotation_df.loc[img_key] 60 # Load the image and its target (bounding boxes and labels) ---> 61 image, target = self._load_image_and_target(annotation) 63 # Apply the transformations, if any 64 if self._transforms:

File ~\YOLOX\windows_utils.py:89, in HagridDataset._load_image_and_target(self, annotation) 87 bbox_tensor = torchvision.ops.box_convert(torch.Tensor(bbox_list), 'xywh', 'xyxy') 88 # Create a BoundingBox object with the bounding boxes ---> 89 boxes = BoundingBox(bbox_tensor, format='xyxy', canvas_size=image.size[::-1]) 90 # Convert the class labels to indices 91 labels = torch.Tensor([self._class_to_idx[label] for label in annotation.labels])

NameError: name 'BoundingBox' is not defined"

cj-mills commented 8 months ago

Hi @Zafoue,

Sorry about that. I must have forgotten to push the updated windows_utils.py file for torchvision 0.16+ to GitHub. I have done so now, and you can download it from the link below:

Zafoue commented 8 months ago

Hi @cj-mills

I managed to get it to work on Jupyter .

Do you have a tutorial for creating a custom dataset?

cj-mills commented 8 months ago

Hi @Zafoue,

I have not made a tutorial for creating custom datasets yet. I've received several questions about this, so I'll try to make time for one.

If you need something now, there are free annotation tools like CVAT and automated annotation methods, as shown in the following videos:

Zafoue commented 8 months ago

Hi @cj-mills ,

I use RoboFlow for labeling images, but my question is, in which format do I need to export? I looked at your dataset "hagrid-sample-30k-384p," and I noticed there's one labeling file for a set of images. I've tried the yolov8 format and some others, but I always end up with one labeling file per image. As a result, I have multiple labeling files, which doesn't match the format you use in the scripts.

Could you advise me on the correct export format or guide me on how to get a single labeling file for a set of images?

Thank you very much!

ryan-michaud commented 7 months ago

Christian, what a beast of a tutorial! I would like to reproduce this on my own data. How were you able to become so proficient at this ? Would you be able to recommend some learning resources for a python noob?

My background is electrical engineering. Decent proficiency in Matlab, not much Python.

I am thinking :

  1. Leveling up Python skills with some general online course, then
  2. Doing a course such as Fast.ai, or Deep Learning on Coursera?

There is just so much learning material out there, I'm not sure where to start and how to be efficient about it. My immediate interest is in fine tuning object detection models on my own data (as you have done here). If you have any suggestions, it would be greatly appreciated.

cj-mills commented 7 months ago

Hi @ryan-michaud,

The fast.ai courses are my go-to recommendation for getting started with deep learning. If you have some existing coding experience, you can probably make it through without already being proficient in Python.

If you want to build a foundation with Python first, I don't have a default tutorial to recommend (there are so many these days). However, the latest CS50x course is probably a safe bet:

As for the fast.ai courses, part 2 is optional but highly recommended. It is valuable even just from a Python development standpoint.

I recommend having a project to work on while going through the fast.ai course. You seem to have an object detection project in mind, so that's good.

The latest iteration of the courses doesn't go in-depth on object detection specifically. However, an earlier version does:

As for your project, I made some tutorials recently for working with various bounding box annotations in PyTorch, which you might find applicable:

If you would like to explore the YOLOX model specifically, you can view the code for my pip package in the following GitHub repository:

Above all, don't feel you need to wait until you think you have enough theoretical knowledge to start working on your project. Learn what you need as you need to.

Afterward, if you want to test your understanding, make a tutorial. Teaching others how to do something does wonders for identifying any relevant gaps in your knowledge. 😉

ryan-michaud commented 7 months ago

@cj-mills Thank you for such a detailed and thoughtful reply. I hope to give back one day as you have ;-)

cj-mills commented 7 months ago

@ryan-michaud You're very welcome! Best of luck with your project!

Boo2z commented 5 months ago

Hi @cj-mills,there is a version of this project for Windows on Github called pytorch-yolox-object-detector-training-windows.ipynb, how can I modify it to run it on Linux?Do I need to change just this line? "from windows_utils import hagriddataset" or something else?Thank you)

cj-mills commented 5 months ago

Hi @Boo2z,

There is already a Linux version of the notebook available for this tutorial. You can find links for the available notebooks in the Tutorial Code dropdown under the Getting Started with the Code section:

OlegKochetkov-git commented 4 months ago

Hi @cj-mills. I plan to train the model using this notebook, and then use my trained model in this sentis demo, only roll back the Unity version to 2022, and leave Sentis at version 1.3. After Unity moved to 2022, your project worked well. Should my trained model work without problems? Or theoretically there could be difficulties? Thank you in advance.

cj-mills commented 4 months ago

Hi @OlegKochetkov-git, I believe Sentis 1.3 (and the new 1.4) only officially supports Unity 2023.2 or newer:

hericah commented 4 days ago

Thanks for your great tutorial. May I know which need to change if we want use datasets from roboflow? Pascal VOC?

cj-mills commented 3 days ago

Hi @hericah,

If your dataset is currently in roboflow, you can export the annotations to COCO JSON format. You can also use roboflow to convert existing annotations from Pascal VOC to COCO. From there, you can check out how to work with COCO bounding box annotations in the tutorial linked below:

hericah commented 2 days ago

I got this error when running it on M1 Mac { "name": "PicklingError", "message": "Can't pickle <function at 0x16da18c10>: attribute lookup on main failed", "stack": "--------------------------------------------------------------------------- PicklingError Traceback (most recent call last) Cell In[39], line 1 ----> 1 train_loop( 2 model=model, 3 train_dataloader=train_dataloader, 4 valid_dataloader=valid_dataloader, 5 optimizer=optimizer, 6 loss_func=yolox_loss,
7 lr_scheduler=lr_scheduler, 8 device=torch.device(device), 9 epochs=epochs, 10 checkpoint_path=checkpoint_path, 11 use_scaler=True)

Cell In[34], line 36, in train_loop(model, train_dataloader, valid_dataloader, optimizer, loss_func, lr_scheduler, device, epochs, checkpoint_path, use_scaler) 33 # Loop over the epochs 34 for epoch in tqdm(range(epochs), desc=\"Epochs\"): 35 # Run a training epoch and get the training loss ---> 36 train_loss = run_epoch(model, train_dataloader, optimizer, lr_scheduler, loss_func, device, scaler, epoch, is_training=True) 37 # Run an evaluation epoch and get the validation loss 38 with torch.no_grad():

Cell In[33], line 24, in run_epoch(model, dataloader, optimizer, lr_scheduler, loss_func, device, scaler, epoch_id, is_training) 21 progress_bar = tqdm(total=len(dataloader), desc=\"Train\" if is_training else \"Eval\") # Initialize a progress bar 23 # Loop over the data ---> 24 for batch_id, (inputs, targets) in enumerate(dataloader): 25 # Move inputs and targets to the specified device 26 inputs = torch.stack(inputs).to(device) 27 # Extract the ground truth bounding boxes and labels

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/site-packages/torch/utils/data/dataloader.py:440, in DataLoader.iter(self) 438 return self._iterator 439 else: --> 440 return self._get_iterator()

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/site-packages/torch/utils/data/dataloader.py:388, in DataLoader._get_iterator(self) 386 else: 387 self.check_worker_number_rationality() --> 388 return _MultiProcessingDataLoaderIter(self)

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/site-packages/torch/utils/data/dataloader.py:1038, in _MultiProcessingDataLoaderIter.init(self, loader) 1031 w.daemon = True 1032 # NB: Process.start() actually take some time as it needs to 1033 # start a process and pass the arguments over via a pipe. 1034 # Therefore, we only add a worker to self._workers list after 1035 # it started, so that we do not call .join() if program dies 1036 # before it starts, and del tries to join but will get: 1037 # AssertionError: can only join a started process. -> 1038 w.start() 1039 self._index_queues.append(index_queue) 1040 self._workers.append(w)

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/multiprocessing/process.py:121, in BaseProcess.start(self) 118 assert not _current_process._config.get('daemon'), \ 119 'daemonic processes are not allowed to have children' 120 _cleanup() --> 121 self._popen = self._Popen(self) 122 self._sentinel = self._popen.sentinel 123 # Avoid a refcycle if the target function holds an indirect 124 # reference to the process object (see bpo-30775)

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/multiprocessing/context.py:224, in Process._Popen(process_obj) 222 @staticmethod 223 def _Popen(process_obj): --> 224 return _default_context.get_context().Process._Popen(process_obj)

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/multiprocessing/context.py:288, in SpawnProcess._Popen(process_obj) 285 @staticmethod 286 def _Popen(process_obj): 287 from .popen_spawn_posix import Popen --> 288 return Popen(process_obj)

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/multiprocessing/popen_spawn_posix.py:32, in Popen.init(self, process_obj) 30 def init(self, process_obj): 31 self._fds = [] ---> 32 super().init(process_obj)

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/multiprocessing/popen_fork.py:19, in Popen.init(self, process_obj) 17 self.returncode = None 18 self.finalizer = None ---> 19 self._launch(process_obj)

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/multiprocessing/popen_spawn_posix.py:47, in Popen._launch(self, process_obj) 45 try: 46 reduction.dump(prep_data, fp) ---> 47 reduction.dump(process_obj, fp) 48 finally: 49 set_spawning_popen(None)

File /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-env/lib/python3.10/multiprocessing/reduction.py:60, in dump(obj, file, protocol) 58 def dump(obj, file, protocol=None): 59 '''Replacement for pickle.dump() using ForkingPickler.''' ---> 60 ForkingPickler(file, protocol).dump(obj)

PicklingError: Can't pickle <function at 0x16da18c10>: attribute lookup on main failed" }

cj-mills commented 2 days ago

Hi @hericah,

The issue stems from the difference between how multiprocessing in Python works on Linux and macOS.

Alternatively, you can set the value for num_workers in data_loader_params to 0 to disable multiprocessing.