Closed feiniao66 closed 2 years ago
Indeed. This script will take two videos along with coco-style human keypoints annotations for their duration and would predict output video with motion similarity for each N frames. Technically, you don't even need videos, but just keypoints annotations should be enough. Video are only used for visualisation purposes, but sequence of keypoints are used for motion similarity estimation.
Unfortunately, we have used private dataset for this purpose, so we cannot release this information. If it is essential for you to produce video output, there are two options: 1) find publicly available videos with coco-style annotations; 2) use human pose estimation network to create annotations for whatever video you prefer.
If you compare video actions, you need to make coco style video annotations yourself. For example, you can annotate coco video by using this multiposenet. Finally, you can use information single pair_ Visual s generates and displays the similarity of each body part, as shown in the following figure
Yes exactly. Feel free to propose code changes. Contributions are welcome.
These action sequences come from the video. My understanding is to put the joint coordinates in each frame of the video together, and then put the joint sequence of an action together. For example, if there are 60 frames of this action video, there are 60 groups of corresponding joint sequences. Secondly, the coco annotation on the video, as shown in the figure below, is to mark the joint points and then visualize them, which has nothing to do with the similarity results.And whether the coordinates of the joint sequence are calculated relative coordinates or world coordinates.
1) the video should contain only a single person. You don't need to group frames by action for this algorithm.
2) I believe it would be best to obtain a example files containing annotated frames with joints. Let's see if @sanghoon @kimdwkimdw @chico2121 @SukhyunCho can help us with that.
The video and motion sequences should also be consistent. I see that there is a DTW algorithm in the program. If I do the data myself, do I need to align it manually? Secondly, the video format is 1920 * 1080 and FPS is 30. There should be no other necessary requirements? Finally, I want to manually label the video action sequence, such as using labeme, and finally convert it into coco format annotation. Is this feasible?
Yes, time alignment is possible and in fact there is a script argument parameter to do so. The DTW is only used to align frames in a current sliding window, not the whole video.
Video doesn't have any special requirements.
It is feasible to annotate frames manually, but perhaps better option would be to use open-source human pose estimation networks to output predictions in COCO format.
Can you please, explain better the argument similarity_measurement_window_size ? If I have two oversampled sequences, it compute the maximum between them? And if I have one? Thanks
@MagnusAnders once we sample shorter video sequences from the video using a sliding window approach, the sequences are fed to the neural network encoder to produce embeddings. The whole number of those embeddings would constitute embedding for the whole video. The parameter similarity_measurement_window_size
decides how many of these embeddings to use in both sequences to determine the motion similarity score. I believe that changing this parameter is equivalent to changing the video_sampling_window_size
parameter (e.g. doubling the first one is equivalent to doubling the second one).
Among the four parameters -- video is about the path of the action video, and - V is about the coco action sequence of the corresponding action video, and these four parameters are necessary, right?