AlvinYH / Faster-VoxelPose

Official implementation of Faster VoxelPose: Real-time 3D Human Pose Estimation by Orthographic Projection
MIT License
154 stars 18 forks source link

How to define the axes in my own scenario #21

Closed gpastal24 closed 1 year ago

gpastal24 commented 1 year ago

Hello I am trying to use this model in a real world setting. Could you please explain how the axes should be defined during calibration?(I have them as x,z,y, z being the height dimension) It appears that each camera predicts its own pose and they are in completely different positions. Nontheless the projection back to the camera works ok for each respective camera if the pred comes from that one.

1124792823 commented 10 months ago

Hello,

I've encountered the same issue as you. The individual cameras seem to predict their own poses, resulting in completely different positions. However, I also noticed that the back-projection to each respective camera works fine when the prediction originates from that specific camera. Have you found a solution or workaround for this problem?

gpastal24 commented 10 months ago

Yes, if i remember correctly i had to use Rw2c (world to cam) and Tw2c. These are the matrices opencv solvepnp has as output, if you calibrate your cameras yourself. Also i rotate the rotation matrix my multiplying with the M matrix on panoptic.py file, so that height is the third dimension (i had it as 2nd when calibrating). Lastly you have to make sure you define the space center and space size correctly, i used 4x4 capture space for example and my origin point was close to one of the csmeras. If you use a 4x4 you should also change the number of voxels for the space acxordingly (from 80 to 40)

gpastal24 commented 10 months ago

If you havent calibrated your camera yourself and you have Rc, Tc you can first find Rw (Rc^-1) and Tw from eq Tc= -RwTw if i remember correctly from the top of my head. So Tw would be - TcRc (i think)

1124792823 commented 9 months ago

Hello, Thank you very much for your previous response. I greatly appreciate the time and effort you put into helping me.I'm reaching out to inquire about your experience with Faster-VoxelPose. Specifically, I'm curious if you have tested your own image data with this system. I am reaching out to seek some advice on using Faster-VoxelPose, particularly regarding camera calibration and the testing of personal image data. As a beginner in this area, I am planning to set up cameras at my own site. I understand that obtaining camera calibration data is a crucial step in this process, but I'm not entirely sure how to go about it. Could you provide some guidance on how to acquire this calibration data? Additionally, I am curious if it's possible to visualize the results in video form, and if you have had any experience in testing your own image data with Faster-VoxelPose. Any insights or suggestions you can offer would be greatly appreciated.❤

Thank you very much for your time and assistance.

gpastal24 commented 9 months ago

@1124792823 Hello I have tested the model with live camera feed. The cams were calibrated (intrinsics and extrinsics parameters known). I use a pattern like this one link. An example on how to calibrate the camera is provided in the following link.

Basically you need to map each point that is detected in the pattern from the camera POV to the world coordinate system that you will define (where your (0,0,0) will be). In reality, you have to measure only the 4 corners of the pattern since the distances between the points are fixed (measure them with a ruler, it depends on your pattern size).

So you create a script that has as input the 4 corners in the 3D space and and it detects the points on the pattern. Then you map them with a function that you will create ( I believe every 4 points you add a length unit on x - axis for example).

You can visualize the results, both the projections and the 3D data as well. I created a function to convert the output of the model in COCO keypoint format (17,3) and projected them on the corresponding images.

You will also need to create a new yaml file with your own space configuration in my case it was 4x4x2. 4x4 is the square the cameras were capturing , lets say i had 3 cameras at (0.0,0.0,Z1), (4.0,0.0,Z2), (0.,4.0,Z3). I had to change the voxels per axis as well from 80x80x20 to 40x40x20 ( the space is decritized the same way the model was trained on). If you can try it on a bigger space like 8x8x2 as it was trained on I believe you ll get better results, although from my experience they are pretty OK, especially for 1 person. I guess if I had used more cameras in this small space it would be also better for more people but I still think the fact the cubes to analyzed are smaller than in the space config that it was trained on plays a role (from 80x80 HW to 40x40).

1124792823 commented 9 months ago
    Thank you so much for your detailed and informative response. I truly appreciate the time and effort you took to explain the process, especially the practical aspects of camera calibration using a pattern and mapping points from the camera's point of view to the world coordinate system.

Your guidance on creating a script for the 3D space mapping and the advice on adjusting the voxel space configuration for different camera setups are particularly helpful. Also, your insights into the results of the model, especially in relation to the number of cameras used and the size of the space, are invaluable.I am encouraged by your positive experience with the model, especially in a one-person scenario, and I'm looking forward to applying these learnings to my project.Once again, thank you for your assistance and for sharing your experience. It has been incredibly helpful.

                    ***@***.***

---- Replied Message ----

     From 

        ***@***.***>

     Date 

    11/28/2023 18:57

     To 

        ***@***.***>

     Cc 

        ***@***.***>
        ,

        ***@***.***>

     Subject 

          Re: [AlvinYH/Faster-VoxelPose] How to define the axes in my own scenario (Issue #21)

@1124792823 Hello I have tested the model with live camera feed. The cams were calibrated (intrinsics and extrinsics parameters known). I use a pattern like this one link. An example on how to calibrate the camera is provided in the following link. Basically you need to map each point that is detected in the pattern from the camera POV to the world coordinate system that you will define (where your (0,0,0) will be). In reality, you have to measure only the 4 corners of the pattern since the distances between the points are fixed (measure them with a ruler, it depends on your pattern size). So you create a script that has as input the 4 corners in the 3D space and and it detects the points on the pattern. Then you map them with a function that you will create ( I believe every 4 points you add a length unit on x - axis for example). You can visualize the results, both the projections and the 3D data as well. I created a function to convert the output of the model in COCO keypoint format (17,3) and projected them on the corresponding images. You will also need to create a new yaml file with your own space configuration in my case it was 4x4x2. 4x4 is the square the cameras were capturing , lets say i had 3 cameras at (0.0,0.0,Z1), (4.0,0.0,Z2), (0.,4.0,Z3). I had to change the voxels per axis as well from 80x80x20 to 40x40x20 ( the space is decritized the same way the model was trained on). If you can try it on a bigger space like 8x8x2 as it was trained on I believe you ll get better results, although from my experience they are pretty OK, especially for 1 person. I guess if I had used more cameras in this small space it would be also better for more people but I still think the fact the cubes to analyzed are smaller than in the space config that it was trained on plays a role (from 80x80 HW to 40x40).

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

turtlebot1 commented 5 months ago

@gpastal24 Hello, Thank you for the awesome explanation about the transformation. But one question I am having is how/where could we know the would coordinate frame so that we can calculate the rotation matrix of each cameras? secondly, I saw that you tested this system with live camera feed, may I know if you trained the network for that particular scenario or did you use the pretrained panoptic network? If you are using the pretrained network, may I know how accurate is the prediction and how many frames per second is it plotting? Please let me know regarding this as it would be really helpful for my research.

@1124792823 May I know if you were able to get this model running for your own video feed?

gpastal24 commented 5 months ago

@turtlebot1 You can set any point in space as your origin point and you calibrate the cameras depending on that. Then you have to define the space size and center in the config file. For the testing I used a model trained on Panoptic since it is more convenient and faster than using Mask RCnn or any other detector and top down 2D pose estimation ,as in Shelf and Campus, on many cameras.

The 2D backbone that is used in Panoptic does not require any post processing as well. I had trained the network from scratch with the old version of the code, since the weights were not available then.

The processing (2D heatmaps + 3D joints ) was around 40 FPS for 3 cameras on an RTX 3080 (converted the backbone and CNN parts of the 3D network to tensorrt). If you dont convert the network you will have close to 20 FPS for 3 cameras. The pretrained network will naturally lose some accuracy in other scenes and the use of 3 cameras instead of 5 will further degrade its performance. You may experience more false positives with many people in the scene but overall the projections back to the cameras are accurate for the most part.

turtlebot1 commented 5 months ago

@gpastal, Thank you very much for your reply and for the information that I can calibrate the camera with any point as my origin point. Currently, I am testing the images normally as how the authors are calling in the visualize.ipynb file, that is I am taking images from different cameras at a particular time instance, and using it as in input to the model and the processing is only around 12 FPS in a RTX 4080, do you think I can improve this by converting the network? Another question I am having is, how do I measure the space size and space center of my observation space? While using a single camera and playing around the parameters, the 2D pose estimation on the images is behaving perfectly but the 3D projections of those points are in weird shapes. May I know how can I find a way out of this? Any inputs you could give would be really helpful and I should say, your comments added to the literature has been really informative. Thank you in advance.

gpastal24 commented 5 months ago

@turtlebot1 I can't help you with the use of 1 camera to be honest , I have never tried that. Plus I don't know how you should define the space size when you use only 1 camera. You could try training the model from scratch in panoptic but I don't know if this would work. The authors of VoxelPose (not Faster Voxel Pose) did report results with the use of 1 camera though, so I guess it might work. I would advice you to use at least 3 cameras.

Regarding the reprojection results , its either caused from incorrect calibration or the model is not able to procure accurate 3D estimations, since it was not trained for the monocular case.

Yes if you convert the model to tensorrt you should expect to double its throughput.

turtlebot1 commented 5 months ago

@gpastal24, Thank you again for your detailed reply. I truly appreciate your time and effort. Considering the fact that the model has been trained by the authors using multi-camera views, I have decided to go with a stereo vision set up for pose estimation. As per link which you have previously shared, I am able to get the intrinsic parameters of both cameras but not the camera extrinsics (rotation matrix and translation vectors) for each camera. May I know what would be the world coordinate frame considered in this project so that I can calculate the extrinsics of each camera with respect to the world frame? In case of stereo calibration, could you help me with how do I get the space center and space size properly? Apologies for the consecutive replies, I am new to this area which is the reason I am coming up with a lot of questions and I found your responses really helpful to the same.

gpastal24 commented 5 months ago

@turtlebot1

When you say stereo vision setup, do you mean you have two lenses in a single device? What I mean by using 3 or more cameras is to have them far apart from each other. for example in a triangle formation. After calibrating the cameras you will have the rotation matrices and the translation vectors. For simplicity lets say they are in the locations (0,0,Z1), (X,0,Z2), (0,Y,Z3). the capture space in this case would be X, Y, 2 (you won't have to change the height parameter unless you do something extreme). Space center would be X/2 , Y/2 , 1.

You can use the right hand rule to define your system. Pick a point in a wall or something and measure the distances of the 4 corners of the pattern from this point. Then since you know the distances between the points of the pattern you can obtain the extrinsics as well.

turtlebot1 commented 5 months ago

@gpastal24 , Thank you for your time and support, really appreciate it. By stereo vision I meant to say using two cameras which are at known distance apart from each others and focusing on the same object/person. I understood how to define the capture space and space center from your explanation. The major issue I am facing is with the Rotation matrix and Translation vectors. I am able to land the camera intrinsics (fx, fy, cx, cy, p and k) from the above link you have shared and I am sorry to ask this again but I am not completely sure how to calibrate the extrinsics as in how to have a set of 3D points in the world coordinate system and their corresponding 2D projections in the image. Could you help me with some information or literature on how to get the camera extrinsics from the triangle formation of the camera wrt the world coordinate frame (I am also not sure what is the world coordinate frame in this case as it is not defined in the literature)? Anything you could share would be really helpful. Thank you once again for all your help!

gpastal24 commented 5 months ago

@turtlebot1 You can calibrate each camera on its own so you won't mess things up. The world coordinate frame is the origin point (0,0,0) that you define wherever you want it to be. Then you estimate the rotation and translation from the 2D to 3D correspondences. 2D would be the 44 points (could be more , could be less, depending on the pattern you are using) of the pattern in the camera frame, and 3D the positions of the same points in the World Coordinate System, which is defined from your origin point.

turtlebot1 commented 5 months ago

@gpastal24 Thank you once again for your guidance and apologies to disturb you again. I followed the experimental procedure as described by you and successfully obtained output for custom camera frames. However, I've encountered an issue where the coordinate frames seem to be misaligned: what is described as the x-y plane in your work appears as the x-z plane in my setup, and so on. Could you possibly shed some light on why this discrepancy might be occurring? Additionally, may I know if it would be possible to share the code where you have made the changes of taking in custom video image and processing it? I am quite lost currently not knowing where can I possible go wrong. Any information you can give would be really helpful. Thank you once again.

gpastal24 commented 5 months ago

@turtlebot1 You might have to rotate the axes as in the following lines, your M matrix can be different from the one in these lines

https://github.com/AlvinYH/Faster-VoxelPose/blob/733aa9856f277498e4be40cd97105138c90b2ca8/lib/dataset/panoptic.py#L171-L205

When I have some free time I will try to attach some code snippets

turtlebot1 commented 5 months ago

@gpastal24 Thank you for your assistance and patience. Using the information you have provided, I was able to get the model working.

gpastal24 commented 5 months ago

@turtlebot1 thats good to hear. How was the performance?

turtlebot1 commented 4 months ago

@gpastal24 I am still working on improving the performance, I have not seen a lot of misclassifications, but the model is having a high latency (2 cameras and 5 FPS is the speed at which the model is processing.) Previously, I was not approaching the way the authors are approaching in 'Validate.py' file, in

test_dataset = eval('dataset.' + config.DATASET.TEST_DATASET)(
        config, False,
        transforms.Compose([
            transforms.ToTensor(),
            normalize,
        ]))

    test_loader = torch.utils.data.DataLoader(
        test_dataset,
        batch_size=config.TEST.BATCH_SIZE,
        shuffle=False,
        num_workers=config.WORKERS,
        pin_memory=True)

but rather, I am going the way they are doing in the visualize.ipynb file. That is, I am getting images from each camera (for synchronous timeframes), converting it to a tensor, and inputting to the model. I believe this is the reason for having a higher latency and I am working with the former type data processing. May I know if your processing time is faster by going the former manner? Please let me know regarding it. Thank you.

turtlebot1 commented 4 months ago

@gpastal24 I had one more question, I am able to plot the pose estimates key-point images in 3d and 2d perfectly, but when I pose estimates on the image is not perfect (that is, the pose estimates are not correctly aligning with the human in the image). May I know what could be the reason for this issue? Kindly let me know whenever possible. Thank you in advance

gpastal24 commented 4 months ago

@turtlebot1 Hello, I didn't quite understand your issue. Are you referring to some false positives? The projections of false positives will not align to any person probably.

turtlebot1 commented 4 months ago

@gpastal24, The image I have attached below can explain the issue I am facing. So I am getting the pose estimates properly in the coordinate planes, but in the case of 'images with poses' (i.e. superimposing the pose estimates over the images), they are not aligning with the actual humans but are offset by a little. May I know what could be the issue for this? Thank you for your time and patience, I really appreciate it!

Screenshot from 2024-04-24 12-01-06

turtlebot1 commented 4 months ago

@gpastal24, Another question I had is the following. I have 3 cameras: Camera_1 is at (0,0,Z), Camera_2 is at (4,0,Z) and is having an angle of -90 degrees with respect to Camera_1, Camera_3 is at (0,4,Z) and is having an angle of 45 degrees with respect to camera_1. I calibrate all the cameras individually using the opencv toolbox and checkerboard pattern. The codes I used for calibration can be found in this link. One question I had from the above method is that and please correct me if I am wrong, if we are using one camera as reference, we should rotate and translate the other cameras with respect to the reference cameras as well. In that case, how do we rotate the other cameras with respect to the reference camera? Please let me know about it whenever possible. Thank you again for your kind consideration.

gpastal24 commented 4 months ago

I cant answer with certainty, but it appears to me something is wrong with the calibration parameters in the image? It is weird since this are frames from shelf. Are you using the camera params from shelf dataset or what?

Did you calibrate them with a common reference point or no? If you calibrated them with a common reference point you don't have to do something more apart from maybe rotating them with M matrix as discussed previously. If you have calibrated each camera with its own reference point (optical center, I suppose) it is a bit more tricky. If you are certain about the rotations and positions you can estimate their R and T relative to the reference camera with a math formula that I cant seem to find right now ( i will look for it a bit more). I suppose though you can directly calibrate them with reference to the 1st camera by measuring the distance of the patterns corners from the camera sensor.

Of course this is a bit more challenging than selecting an arbitrary point (how can you be sure you are measuring from the optical center?).

turtlebot1 commented 4 months ago

@gpastal24 , Yes I am using the exact camera params from shelf dataset.

I did not calibrate them with a common reference point, but I will do that right now. Previously, I found the individual ((R_c1, Tc1), (R_c2, Tc2), (R_c3, Tc3)) for all three cameras individually. Then considered (R_c1, Tc1) to be (R_w, T_w) and then with respect to the distance and angle of rotations, I landed R_w,R_c2 = R_w(R_c1_c2 R_c2) and the same for R_w,R_c3 respectively. I can understand this is a longer route compared to what you have mentioned. But let me do it the way you have mentioned (calibrating all cameras with a common point) and see how the results turn out. Thank you again for the information.

gpastal24 commented 4 months ago

@turtlebot1 Ok try that.

Regarding the shelf frames, did you change the space dims, space center ? I can't think of anything else

turtlebot1 commented 4 months ago

@gpastal24 I did not change the space dimensions nor the space center, but the configuration file for shelf was downloaded from 'microsoft/VoxelPose/' and not 'Faster-VoxelPose/' Other than that, everything else is the same.

gpastal24 commented 4 months ago

@turtlebot1

Hmm OK, I think you love torturing yourself :P

turtlebot1 commented 4 months ago

@gpastal24 Ha ha ha, at this point, I would do anything to decrease the false positives and reduce the latency of this pose estimation algorithm. I haven't found anything as good as the work done here in VoxelPose, so I just have to fix it for my research and there's no other go lol. Thank you though, you've been really helpful right from the beginning.

gpastal24 commented 4 months ago

@turtlebot1

Yes, there aren't any other works that match the accuracy and inference speed of this method. TEMPO might be more accurate but it extremely slow overall (dont know if it has to do with mmcv but still)

turtlebot1 commented 4 months ago

@gpastal24 Have you tried TEMPO? That was supposed to be my backup plan in case VoxelPose doesn't work for my use case. As per their literature, they can process similar FPS (29) to that of Faster-VoxelPose (31). Didn't know it was extremely slow. I can imagine the reason they're slow, the network is designed not only for pose estimation but also for forecasting and they have a filter running in parallel to the voxel detection.

gpastal24 commented 4 months ago

@turtlebot1

Yes i tried it, it run at 2 fps i dont know if I did something wrong, but its unlikely. They measure only the 3d part of their network i guess.

Plus the results at ap25 are at 81% which is weird

turtlebot1 commented 4 months ago

@gpastal24 I calibrated multiple cameras pointed at a common reference image (TV having the checkerboard pattern) and I was able to get the required output with just some misclassifications. I did have to make some changes to the translation vector of one of the cameras in order to get the perfect overlay of pose estimates on the image. Thank you for information you have provided.

gpastal24 commented 4 months ago

@turtlebot1 That's great ! Happy testing I guess :smile: !

turtlebot1 commented 4 months ago

@gpastal24 Ha ha, thank you for all your help! 😅

i-AMgroot7 commented 4 months ago

@gpastal24 @turtlebot1 ,

I am using a setup similar to what you have discussed above. By calibrating the cameras (3 in my case) using a single reference point (checkerboard), each camera is having its own world reference plane. Is this how you are getting?

I am using 'cv2.calibrateCamera()' function, and then converting the rotation vectors to matrices using 'Rodriguez' method. Please let me know if I am doing it the right way.

gpastal24 commented 3 months ago

Hi @i-AMgroot7 .

It is better to have a reference point and calibrate the cameras towards that point.

turtlebot1 commented 3 months ago

@i-AMgroot7 I used the solvePNP method to find the rotation and translation vectors and then used rodriguez to find the rotation matrix.