cj-mills / christianjmills

My personal blog
https://christianjmills.com/
Other
4 stars 0 forks source link

posts/pytorch-train-keypoint-rcnn-tutorial/ #46

Open utterances-bot opened 10 months ago

utterances-bot commented 10 months ago

Christian Mills - Training Keypoint R-CNN Models with PyTorch

Learn how to train Keypoint R-CNN models on custom datasets with PyTorch.

https://christianjmills.com/posts/pytorch-train-keypoint-rcnn-tutorial/

troymyname commented 10 months ago

Thank you for the amazing tutorials Chris!

zickezacke commented 9 months ago

Great tutorial Chris! I tried to duplicate your approach with triggering the assert:

Loss is NaN or infinite at epoch 0, batch 0. Stopping training.

Thanks

cj-mills commented 9 months ago

Hi @zickezacke,

I verified the training code in the tutorial runs successfully on a CPU and a CUDA GPU this morning.

I forgot to include a version of the Jupyter Notebook for running on Windows, so I added that. Python multiprocessing works differently on Windows versus Linux, so the training code needs slight tweaks. Although, I don't believe that is the source of your issue, as that results in a different error.

I don't have a Mac, so I can't verify how the code runs there if that is what you are using.

Were you trying to implement the code manually? If so, try downloading and running the pre-completed training notebook to see if that runs successfully.

zickezacke commented 9 months ago

Hi @cj-mills ,

Thank you for your response. I appreciate it. I tried running your script with a copy of the notebook with the same result. The notebook is running in a WSL2 with CUDA. I do not know why the "nan" for the loss_item is occurring but going to research if I can figure it out.

Wish you a great weekend!

troymyname commented 8 months ago

Hi Chris, have you considered writing a tutorial to deploy the models? For instance, have you considered converting the model from ONNX to a format that can be used in a TensorRT environment? Thanks again for your efforts!

cj-mills commented 8 months ago

@troymyname Like this one?

If so, I have been considering it for the other model tutorials. It's just a matter of finding the time to do those (and the other tutorials I've had planned for a while).

troymyname commented 8 months ago

@troymyname Like this one?

If so, I have been considering it for the other model tutorials. It's just a matter of finding the time to do those (and the other tutorials I've had planned for a while).

@cj-mills That's correct. I am looking into solutions to quantize the model and prepare it for deployment. I have looked into several conversion pathways such as Torch --> TensorRT or Torch --> ONNX --> TensorRT. I have used the Polygraphy package from NVIDIA to prepare the model prior to conversion. However, I am running into issues at the moment. I will keep trying, and also look out for your post on how to do so for the KeyPoint RCNN model. Thanks!

zickezacke commented 8 months ago

I agree with you. Lots of moving pieces. While I was able to archive good results with the R-CNN, I started looking at the Fast R-CNN V3 because. It would reduce required features implementation and increase the overall performance.

On Tue, Apr 2, 2024, 5:23 AM Tonmoy Roy @.***> wrote:

@troymyname https://github.com/troymyname Like this one?

If so, I have been considering it for the other model tutorials. It's just a matter of finding the time to do those (and the other tutorials I've had planned for a while).

@cj-mills https://github.com/cj-mills That's correct. I am looking into solutions to quantize the model and prepare it for deployment. I have looked into several conversion pathways such as Torch --> TensorRT or Torch --> ONNX --> TensorRT. I have used the Polygraphy package from NVIDIA to prepare the model prior to conversion. However, I am running into issues at the moment. I will keep trying, and also look out for your post on how to do so for the KeyPoint RCNN model. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/cj-mills/christianjmills/issues/46#issuecomment-2031497832, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ6WL4POEEI66P3ZDVZLFDY3J2G3AVCNFSM6AAAAABCU2BTW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZRGQ4TOOBTGI . You are receiving this because you were mentioned.Message ID: @.***>

ErikDerGute commented 7 months ago

Hey man, huge thank you for your amazing tutorial. I took your tutorial as a walkthrough to implement the keypointrccn for my custom application. I want to train the net on a custom dataset. The dataset contains n classes, each class contains m keypoints. However, sooner or later I will alway run into the same error: "keypoint_loss = F.cross_entropy(keypoint_logits[valid], keypoint_targets[valid])". Maybe you know where this error comes from. Thanks

cj-mills commented 7 months ago

Hi @ErikDerGute,

Would you mind adding the complete error statement?

The tutorial code does not currently support multiple object classes, so you would need to make some modifications.

First, the sample dataset used in the tutorial has the same number of object classes (one+background) as the dataset used to pre-train the model, so it only updates the keypoint predictor. You would also need to update the model's bounding box predictor to use it with a dataset containing multiple object classes.

Something like this:

from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

num_classes = 3  # Include background as a class, e.g., for 2 actual classes, this would be 3
in_features_box = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features_box, num_classes)

The current training loop assumes there is a single object class (excluding the background class), and hardcodes the values for gt_labels to 1 for the single object class. You would need to update the LabelMeKeypointDataset class to get the index values for the object classes and pass that to the training loop.

It sounds like the different objects classes all have the same number of keypoints. If not, you would need to make some additional changes.

When modifying the keypoint predictor, you would use the keypoint count from the object class with the highest number of keypoints.

Next, you would need to update the training code to address the fact the object classes in your dataset contain varying numbers of keypoints. The loss function expects the keypoint count specified for the keypoint predictor, which will be higher than some of the classes in your dataset.

You could probably address this using the visibility_mask, where the extra keypoints get marked as not visible for object classes with lower numbers of keypoints. Although, I have not tested this approach.

ErikDerGute commented 7 months ago

Thanks for your fast and detailed reply. For sure I adapted the num_classes, num_keypoints ... as you also mentioned above for my application. However, I could figure out the cause of the error by myself. It was completely my fault, as the model expects the keypoints to be formatted in target{}:[[[kp1_obj1], [kp2_obj1], [....]], [[kp1_ob2], [kp2_obj2], [...]]]. My keypoints were formatted like [[kp1_obj1, kp2_ob1, ...], [kp1_obj2, kp2_obj2, ...]]. Unfortunately I overlooked this little flaw for two days. Of course this format error causes big chaos, when trying to match the keypoint idxs in the keypointrcnn_loss function. The final result is the described index error.

Yogeshvasu commented 7 months ago

Thank you for the amazing tutorials Chris! , i need to convert this model to .ptl for deployment in android is it possible , i tried but i am getting error

import torch
import torchvision.models as models
from torch.utils.mobile_optimizer import optimize_for_mobile
import torchvision.transforms as transforms

# Load your PyTorch model (modify this based on your model architecture and loading method)
file_id = val_keys[0]

# Retrieve the image file path associated with the file ID
test_file = img_dict[file_id]

# Open the test file
input_img = Image.open(test_file).convert('RGB')

model = models.detection.keypointrcnn_resnet50_fpn(pretrained=False)

# Define the device where you want to run your model
device = torch.device('cpu')  # or 'cuda' if you have GPU

# Ensure the model is in evaluation mode and move it to the specified device
model.eval()
model.to(device)

# Define your input data and preprocessing pipeline
#input_img = test_img # Your input image (modify this based on your input data)
example = torch.rand(1, 3, 224, 224)
traced_script_module = torch.jit.trace(model, example)
traced_script_module_optimized = optimize_for_mobile(traced_script_module)
traced_script_module_optimized._save_for_lite_interpreter("model.ptl")
cj-mills commented 7 months ago

Hi @Yogeshvasu,

While I would recommend using torch.jit.script instead of torch.jit.trace to resolve the first error message your are likely getting, I don't believe the model is supported by _save_for_lite_interpreter, unfortunately.

Yogeshvasu commented 7 months ago

Hi Chris,

Thanks for your suggestion. i have seen conversion of your model to onnx , can this onnx model can be converted to Tflite since i need to check for deployment in android.

if possible could you please confirm since i am facing error .

from onnx_tf.backend import prepare import onnx

onnx_model_path = 'model.onnx' tf_model_path = 'model_tf'

onnx_model = onnx.load(onnx_model_path) tf_rep = prepare(onnx_model) tf_rep.export_graph(tf_model_path)

cj-mills commented 7 months ago

@Yogeshvasu Unfortunately, I don't believe it is supported by that method either. I'll probably end up replacing the Keypoint R-CNN model used in this tutorial with something that has more general compatibility at some point.

For now, if you just need a human keypoint estimation model for Tflite, checkout this page: