chico2121 / bpe

MIT License
30 stars 12 forks source link

Some questions about inference_single_pair_visuals #11

Closed feiniao66 closed 2 years ago

feiniao66 commented 2 years ago

20211207103551 Among the four parameters -- video is about the path of the action video, and - V is about the coco action sequence of the corresponding action video, and these four parameters are necessary, right?

BAILOOL commented 2 years ago

Indeed. This script will take two videos along with coco-style human keypoints annotations for their duration and would predict output video with motion similarity for each N frames. Technically, you don't even need videos, but just keypoints annotations should be enough. Video are only used for visualisation purposes, but sequence of keypoints are used for motion similarity estimation.

Unfortunately, we have used private dataset for this purpose, so we cannot release this information. If it is essential for you to produce video output, there are two options: 1) find publicly available videos with coco-style annotations; 2) use human pose estimation network to create annotations for whatever video you prefer.

feiniao66 commented 2 years ago

If you compare video actions, you need to make coco style video annotations yourself. For example, you can annotate coco video by using this multiposenet. Finally, you can use information single pair_ Visual s generates and displays the similarity of each body part, as shown in the following figure {8%Y FE8X}T4E`BZ _3YU%0

BAILOOL commented 2 years ago

Yes exactly. Feel free to propose code changes. Contributions are welcome.

feiniao66 commented 2 years ago

These action sequences come from the video. My understanding is to put the joint coordinates in each frame of the video together, and then put the joint sequence of an action together. For example, if there are 60 frames of this action video, there are 60 groups of corresponding joint sequences. Secondly, the coco annotation on the video, as shown in the figure below, is to mark the joint points and then visualize them, which has nothing to do with the similarity results.And whether the coordinates of the joint sequence are calculated relative coordinates or world coordinates. 89I~(%{VC)BW4(TC7GP2T4J

BAILOOL commented 2 years ago

1) the video should contain only a single person. You don't need to group frames by action for this algorithm.

2) I believe it would be best to obtain a example files containing annotated frames with joints. Let's see if @sanghoon @kimdwkimdw @chico2121 @SukhyunCho can help us with that.

feiniao66 commented 2 years ago

The video and motion sequences should also be consistent. I see that there is a DTW algorithm in the program. If I do the data myself, do I need to align it manually? Secondly, the video format is 1920 * 1080 and FPS is 30. There should be no other necessary requirements? Finally, I want to manually label the video action sequence, such as using labeme, and finally convert it into coco format annotation. Is this feasible?

BAILOOL commented 2 years ago

Yes, time alignment is possible and in fact there is a script argument parameter to do so. The DTW is only used to align frames in a current sliding window, not the whole video.

Video doesn't have any special requirements.

It is feasible to annotate frames manually, but perhaps better option would be to use open-source human pose estimation networks to output predictions in COCO format.

ghost commented 2 years ago

Can you please, explain better the argument similarity_measurement_window_size ? If I have two oversampled sequences, it compute the maximum between them? And if I have one? Thanks

BAILOOL commented 2 years ago

@MagnusAnders once we sample shorter video sequences from the video using a sliding window approach, the sequences are fed to the neural network encoder to produce embeddings. The whole number of those embeddings would constitute embedding for the whole video. The parameter similarity_measurement_window_size decides how many of these embeddings to use in both sequences to determine the motion similarity score. I believe that changing this parameter is equivalent to changing the video_sampling_window_size parameter (e.g. doubling the first one is equivalent to doubling the second one).