How to define the axes in my own scenario

gpastal24 commented 1 year ago

Hello I am trying to use this model in a real world setting. Could you please explain how the axes should be defined during calibration?(I have them as x,z,y, z being the height dimension) It appears that each camera predicts its own pose and they are in completely different positions. Nontheless the projection back to the camera works ok for each respective camera if the pred comes from that one.

1124792823 commented 10 months ago

Hello,

I've encountered the same issue as you. The individual cameras seem to predict their own poses, resulting in completely different positions. However, I also noticed that the back-projection to each respective camera works fine when the prediction originates from that specific camera. Have you found a solution or workaround for this problem?

gpastal24 commented 10 months ago

Yes, if i remember correctly i had to use Rw2c (world to cam) and Tw2c. These are the matrices opencv solvepnp has as output, if you calibrate your cameras yourself. Also i rotate the rotation matrix my multiplying with the M matrix on panoptic.py file, so that height is the third dimension (i had it as 2nd when calibrating). Lastly you have to make sure you define the space center and space size correctly, i used 4x4 capture space for example and my origin point was close to one of the csmeras. If you use a 4x4 you should also change the number of voxels for the space acxordingly (from 80 to 40)

gpastal24 commented 10 months ago

If you havent calibrated your camera yourself and you have Rc, Tc you can first find Rw (Rc^-1) and Tw from eq Tc= -RwTw if i remember correctly from the top of my head. So Tw would be - TcRc (i think)

1124792823 commented 9 months ago

Hello, Thank you very much for your previous response. I greatly appreciate the time and effort you put into helping me.I'm reaching out to inquire about your experience with Faster-VoxelPose. Specifically, I'm curious if you have tested your own image data with this system. I am reaching out to seek some advice on using Faster-VoxelPose, particularly regarding camera calibration and the testing of personal image data. As a beginner in this area, I am planning to set up cameras at my own site. I understand that obtaining camera calibration data is a crucial step in this process, but I'm not entirely sure how to go about it. Could you provide some guidance on how to acquire this calibration data? Additionally, I am curious if it's possible to visualize the results in video form, and if you have had any experience in testing your own image data with Faster-VoxelPose. Any insights or suggestions you can offer would be greatly appreciated.❤

Thank you very much for your time and assistance.

gpastal24 commented 9 months ago

@1124792823 Hello I have tested the model with live camera feed. The cams were calibrated (intrinsics and extrinsics parameters known). I use a pattern like this one link. An example on how to calibrate the camera is provided in the following link.

Basically you need to map each point that is detected in the pattern from the camera POV to the world coordinate system that you will define (where your (0,0,0) will be). In reality, you have to measure only the 4 corners of the pattern since the distances between the points are fixed (measure them with a ruler, it depends on your pattern size).

So you create a script that has as input the 4 corners in the 3D space and and it detects the points on the pattern. Then you map them with a function that you will create ( I believe every 4 points you add a length unit on x - axis for example).

You can visualize the results, both the projections and the 3D data as well. I created a function to convert the output of the model in COCO keypoint format (17,3) and projected them on the corresponding images.

You will also need to create a new yaml file with your own space configuration in my case it was 4x4x2. 4x4 is the square the cameras were capturing , lets say i had 3 cameras at (0.0,0.0,Z1), (4.0,0.0,Z2), (0.,4.0,Z3). I had to change the voxels per axis as well from 80x80x20 to 40x40x20 ( the space is decritized the same way the model was trained on). If you can try it on a bigger space like 8x8x2 as it was trained on I believe you ll get better results, although from my experience they are pretty OK, especially for 1 person. I guess if I had used more cameras in this small space it would be also better for more people but I still think the fact the cubes to analyzed are smaller than in the space config that it was trained on plays a role (from 80x80 HW to 40x40).

1124792823 commented 9 months ago

    Thank you so much for your detailed and informative response. I truly appreciate the time and effort you took to explain the process, especially the practical aspects of camera calibration using a pattern and mapping points from the camera's point of view to the world coordinate system.

Your guidance on creating a script for the 3D space mapping and the advice on adjusting the voxel space configuration for different camera setups are particularly helpful. Also, your insights into the results of the model, especially in relation to the number of cameras used and the size of the space, are invaluable.I am encouraged by your positive experience with the model, especially in a one-person scenario, and I'm looking forward to applying these learnings to my project.Once again, thank you for your assistance and for sharing your experience. It has been incredibly helpful.

                    ***@***.***

---- Replied Message ----

     From 

        ***@***.***>

     Date 

    11/28/2023 18:57

     To 

        ***@***.***>

     Cc 

        ***@***.***>
        ,

        ***@***.***>

     Subject 

          Re: [AlvinYH/Faster-VoxelPose] How to define the axes in my own scenario (Issue #21)

@1124792823 Hello I have tested the model with live camera feed. The cams were calibrated (intrinsics and extrinsics parameters known). I use a pattern like this one link. An example on how to calibrate the camera is provided in the following link. Basically you need to map each point that is detected in the pattern from the camera POV to the world coordinate system that you will define (where your (0,0,0) will be). In reality, you have to measure only the 4 corners of the pattern since the distances between the points are fixed (measure them with a ruler, it depends on your pattern size). So you create a script that has as input the 4 corners in the 3D space and and it detects the points on the pattern. Then you map them with a function that you will create ( I believe every 4 points you add a length unit on x - axis for example). You can visualize the results, both the projections and the 3D data as well. I created a function to convert the output of the model in COCO keypoint format (17,3) and projected them on the corresponding images. You will also need to create a new yaml file with your own space configuration in my case it was 4x4x2. 4x4 is the square the cameras were capturing , lets say i had 3 cameras at (0.0,0.0,Z1), (4.0,0.0,Z2), (0.,4.0,Z3). I had to change the voxels per axis as well from 80x80x20 to 40x40x20 ( the space is decritized the same way the model was trained on). If you can try it on a bigger space like 8x8x2 as it was trained on I believe you ll get better results, although from my experience they are pretty OK, especially for 1 person. I guess if I had used more cameras in this small space it would be also better for more people but I still think the fact the cubes to analyzed are smaller than in the space config that it was trained on plays a role (from 80x80 HW to 40x40).

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

turtlebot1 commented 5 months ago

@gpastal24 Hello, Thank you for the awesome explanation about the transformation. But one question I am having is how/where could we know the would coordinate frame so that we can calculate the rotation matrix of each cameras? secondly, I saw that you tested this system with live camera feed, may I know if you trained the network for that particular scenario or did you use the pretrained panoptic network? If you are using the pretrained network, may I know how accurate is the prediction and how many frames per second is it plotting? Please let me know regarding this as it would be really helpful for my research.

@1124792823 May I know if you were able to get this model running for your own video feed?

gpastal24 commented 5 months ago

@turtlebot1 You can set any point in space as your origin point and you calibrate the cameras depending on that. Then you have to define the space size and center in the config file. For the testing I used a model trained on Panoptic since it is more convenient and faster than using Mask RCnn or any other detector and top down 2D pose estimation ,as in Shelf and Campus, on many cameras.

The 2D backbone that is used in Panoptic does not require any post processing as well. I had trained the network from scratch with the old version of the code, since the weights were not available then.

The processing (2D heatmaps + 3D joints ) was around 40 FPS for 3 cameras on an RTX 3080 (converted the backbone and CNN parts of the 3D network to tensorrt). If you dont convert the network you will have close to 20 FPS for 3 cameras. The pretrained network will naturally lose some accuracy in other scenes and the use of 3 cameras instead of 5 will further degrade its performance. You may experience more false positives with many people in the scene but overall the projections back to the cameras are accurate for the most part.

turtlebot1 commented 5 months ago

@gpastal, Thank you very much for your reply and for the information that I can calibrate the camera with any point as my origin point. Currently, I am testing the images normally as how the authors are calling in the visualize.ipynb file, that is I am taking images from different cameras at a particular time instance, and using it as in input to the model and the processing is only around 12 FPS in a RTX 4080, do you think I can improve this by converting the network? Another question I am having is, how do I measure the space size and space center of my observation space? While using a single camera and playing around the parameters, the 2D pose estimation on the images is behaving perfectly but the 3D projections of those points are in weird shapes. May I know how can I find a way out of this? Any inputs you could give would be really helpful and I should say, your comments added to the literature has been really informative. Thank you in advance.

gpastal24 commented 5 months ago

@turtlebot1 I can't help you with the use of 1 camera to be honest , I have never tried that. Plus I don't know how you should define the space size when you use only 1 camera. You could try training the model from scratch in panoptic but I don't know if this would work. The authors of VoxelPose (not Faster Voxel Pose) did report results with the use of 1 camera though, so I guess it might work. I would advice you to use at least 3 cameras.

Regarding the reprojection results , its either caused from incorrect calibration or the model is not able to procure accurate 3D estimations, since it was not trained for the monocular case.

Yes if you convert the model to tensorrt you should expect to double its throughput.

turtlebot1 commented 5 months ago

@gpastal24, Thank you again for your detailed reply. I truly appreciate your time and effort. Considering the fact that the model has been trained by the authors using multi-camera views, I have decided to go with a stereo vision set up for pose estimation. As per link which you have previously shared, I am able to get the intrinsic parameters of both cameras but not the camera extrinsics (rotation matrix and translation vectors) for each camera. May I know what would be the world coordinate frame considered in this project so that I can calculate the extrinsics of each camera with respect to the world frame? In case of stereo calibration, could you help me with how do I get the space center and space size properly? Apologies for the consecutive replies, I am new to this area which is the reason I am coming up with a lot of questions and I found your responses really helpful to the same.

gpastal24 commented 5 months ago

@turtlebot1

When you say stereo vision setup, do you mean you have two lenses in a single device? What I mean by using 3 or more cameras is to have them far apart from each other. for example in a triangle formation. After calibrating the cameras you will have the rotation matrices and the translation vectors. For simplicity lets say they are in the locations (0,0,Z1), (X,0,Z2), (0,Y,Z3). the capture space in this case would be X, Y, 2 (you won't have to change the height parameter unless you do something extreme). Space center would be X/2 , Y/2 , 1.

You can use the right hand rule to define your system. Pick a point in a wall or something and measure the distances of the 4 corners of the pattern from this point. Then since you know the distances between the points of the pattern you can obtain the extrinsics as well.

turtlebot1 commented 5 months ago

@gpastal24 , Thank you for your time and support, really appreciate it. By stereo vision I meant to say using two cameras which are at known distance apart from each others and focusing on the same object/person. I understood how to define the capture space and space center from your explanation. The major issue I am facing is with the Rotation matrix and Translation vectors. I am able to land the camera intrinsics (fx, fy, cx, cy, p and k) from the above link you have shared and I am sorry to ask this again but I am not completely sure how to calibrate the extrinsics as in how to have a set of 3D points in the world coordinate system and their corresponding 2D projections in the image. Could you help me with some information or literature on how to get the camera extrinsics from the triangle formation of the camera wrt the world coordinate frame (I am also not sure what is the world coordinate frame in this case as it is not defined in the literature)? Anything you could share would be really helpful. Thank you once again for all your help!

gpastal24 commented 5 months ago

@turtlebot1 You can calibrate each camera on its own so you won't mess things up. The world coordinate frame is the origin point (0,0,0) that you define wherever you want it to be. Then you estimate the rotation and translation from the 2D to 3D correspondences. 2D would be the 44 points (could be more , could be less, depending on the pattern you are using) of the pattern in the camera frame, and 3D the positions of the same points in the World Coordinate System, which is defined from your origin point.

turtlebot1 commented 5 months ago

@gpastal24 Thank you once again for your guidance and apologies to disturb you again. I followed the experimental procedure as described by you and successfully obtained output for custom camera frames. However, I've encountered an issue where the coordinate frames seem to be misaligned: what is described as the x-y plane in your work appears as the x-z plane in my setup, and so on. Could you possibly shed some light on why this discrepancy might be occurring? Additionally, may I know if it would be possible to share the code where you have made the changes of taking in custom video image and processing it? I am quite lost currently not knowing where can I possible go wrong. Any information you can give would be really helpful. Thank you once again.

gpastal24 commented 5 months ago

@turtlebot1 You might have to rotate the axes as in the following lines, your M matrix can be different from the one in these lines

https://github.com/AlvinYH/Faster-VoxelPose/blob/733aa9856f277498e4be40cd97105138c90b2ca8/lib/dataset/panoptic.py#L171-L205

When I have some free time I will try to attach some code snippets

turtlebot1 commented 5 months ago

@gpastal24 Thank you for your assistance and patience. Using the information you have provided, I was able to get the model working.

gpastal24 commented 5 months ago

@turtlebot1 thats good to hear. How was the performance?