Suggestions for Sparse projection on multiple cameras

KenR22 commented 1 year ago

Hi Timo,

Currently our system projects the subject on half of the cameras(4 out of 9) at a time. I have some queries regarding this. We did a first pass of the all the models in the hybrid net and it does not look very good.

We have a lot of frames that are unlabeled because the subjects are not in them. Could it be biasing the model? Should we do suppress keypoint instead of leaving keypoints unlabeled to mitigate this issue? Or should we make a different 2D dataset that has less unlabeled frames without the subject?
We have a freely moving subject projected on half of the cameras. Do you have an ball park estimate of how many frames we should label or at what frequency we should get segments/frames?
We have 5 1280x720 cameras and 3 640x360 cameras. Right now I am downsampling 720p to 480 keeping the ratio and padding black on the 360 cameras. How do you think we should set up the videos? Undersample 720p to 360p or keep it as it is?

Thanks again for your help and this great tool!

-Ken

JARVIS-MoCap commented 1 year ago

Hi Ken,

It's not a problem to have unlabeled frames. The network ignores them during training and it should not bias the model. You don't have to suppress them, that feature exists to get rid of reprojected keypoints that you don't want to have as part of your trainingset. (For example if your subject is completely covered by a wall and the projected points are all on the wall, in that case you don't want the network to get trained on predicting the entirely occluded subject)
I would estimate you need at the very least around 500 labeled frames (meaning roughly 100 frame-sets in your case) to start getting decent results. For accurate tracking most of the time around a 1000 frames are needed. This really depends on the setup and the complexity of the movement though, so it's only a very rough estimate. How many do you have labeled currently?
I think the approach you are currently taking is better than downsampling everything to 360p. If anything I would rather consider upsampling the 360p videos by a factor of two, maybe you could even try playing around with one of the AI upsampling tools that are fairly common these days. I had mixed results playing around with that a bit on other datasets, but it might be worth a shot. (Keep in mind that you'll have to do the same to the calibration videos in that case of course).

One additional troubleshooting step you could take is to let the network create 2D predictions for just a single camera view. If that works well but the 3D predictions look bad it's more likely to be an issue with the configuration of your 3D network, rather than the amount of labeled data.

If you don't mind you can also share the trainingset you created with me (thueser@dpz.eu is my email) and I can have a look and see if there is some configuration that needs to be tweaked, or give you other suggestions if I notice something that can be improved.

Hope this helps and best, Timo

KenR22 commented 1 year ago

Let me try bumping up the number of labeled frames we have and will reach with our training data later on if that does not work.

Thanks again!

JARVIS-MoCap commented 1 year ago

Sounds good! Please let me know if more Annotations improve your results! :)

KenR22 commented 1 year ago

Hi Timo,

Just wanted to let you know that more annotation helped increase the accuracy of 2D, and 3D is somewhat working.

One part of our subject is detected well, but other parts are not as well. This leads me to the following questions:

Do you have a schematic of parameters for the hybridnet somewhere? I am trying to fine-tune the parameters and am confused with ROI_CUBE_SIZE, GRID_SPACING, etc.
What are the entities in the annotation tools? Would it improve or change the model somehow?
Is there any way to suppress the point below confidence for plotting videos or do temporal filtering in Jarvis?

Thanks again!

KenR22 commented 1 year ago

Hi,

Reaching out again with specific questions.

Do you have a schematic of parameters for the hybridnet somewhere? I am trying to fine-tune the parameters and am confused with ROI_CUBE_SIZE, GRID_SPACING, etc.
We have a square space where our subjects work in one corner at a time. Do you think Jarvis can handle if the subject is not projected on half of the cameras at a time (3 cameras)

Thanks

timohueser commented 1 year ago

Hi Ken,

Sorry for the late reply!

I will put up a manual page that describes all the config parameters and let you know once it's up.
Entities in the AnnotationTool don't really have a use right now. When I designed it I thought it could be used for annotating multiple different subjects (like human and monkey) at once, but there is no functionality like that in any other part of Jarvis at the moment.
There is no built in way to do temporal filtering at the moment. You can of course filter the predictions yourself and replace the data3D.csv file with a filtered version before creating the videos. I will look into adding a confidence threshold to the video creation, that's a very useful feature. It should be easy to add that since we're calculating the confidence already anyways. 2b. As long as at least 2 or 3 cameras see your subject at any given time it should work. Obviously performance will drop compared to the parts of the video where all cameras can see your subject.

Hope this helps, but please feel free to ask more questions if anything is still unclear!

Best, Timo

timohueser commented 1 year ago

Hi again,

here's a link to the manual page that describes the config parameters: https://jarvis-mocap.github.io/jarvis-docs/manual/6_training_hybridnet/

Hope this helps! :)

KenR22 commented 1 year ago

This is great! I will look into it and let you know if we can get it running Thanks!

JARVIS-MoCap / JARVIS-AnnotationTool

Suggestions for Sparse projection on multiple cameras #8