Open khannabeela opened 12 months ago
@khannabeela Hello, there may have been some misunderstanding in my expression, and now I have revised the explanation.
Hello @SignDiff . Thank you for your hard work. I would to ask about the preprocessed data for training of ASL production models. Could you give more information about the skeleton structure and the 3D information you retrieved.
Thank you
@khannabeela
How to extract key points? We extracted two-dimensional (2D) frontal human pose information from videos of different resolutions, including upper body pose information of the body and hands, through OpenPose. Includes 8 upper body key points. 21 keypoints in each hand, which is a total of 42 hand keypoints. These two parts add up to fifty keypoints, each of which has three XYZ messages, or 150 numbers.
Then in steps “json (2D keypoints) to h5” , “h5 to txt (3D keypoints)”, and “txt to skels (Standard Pose Storage)'':
How to complete ``json to h5''? We successively obtain a json number in a folder (a frame of pose information, 50 key points, 150 numbers), and then read all the json numbers in a folder into the key name of an h5 (h5 is a format of numpy) file, multiple folders form multiple build names, and finally form an h5 file.
How to complete ``h5 to txt''? We read each key name of h5 in turn (the original folder name), create the corresponding folder, each folder generates 5 txt files, the last one is the result, the first 4 txt stores the intermediate variable. This is the part of 2D to 3D, and the key formula 3 in the text is the formula of this part. Additionally, we read the relevant data and delete the unqualified data such as NaN, 0, or replace it with the average median of the data. Finally, we condensed the data to about 1/5 of the original, this data comes from the processing of ASL part.
How to complete ``txt to skels''? We read the fifth txt file of each folder in turn, the number of lines in the txt file represents the number of frames of the folder corresponding to the video, we read a line of txt (150 numbers, separated by Spaces, a frame of information), plus a space, and then add a count value (the current line divided by the total number of lines, representing the progress bar), add a space after the count value, Then add the second line txt and continue to repeat the above. Then we put a txt (a video information, the total number of numbers in it = 151* video frames) into a line of content, in turn, tens of thousands of videos are all stored in our standard format.
@SignDiff Thank you for the explanation. I would also like to ask about the recognizer you used for the model evaluation. It seems a bit old and I am not sure about its performance. Do you think if we use better recognizer, we might get good BLEU4?
@SignDiff Thank you for the explanation. I would also like to ask about the recognizer you used for the model evaluation. It seems a bit old and I am not sure about its performance. Do you think if we use better recognizer, we might get good BLEU4?
https://github.com/imatge-upc/slt_how2sign_wicv2023
The performance of the model is good, but they use I3D bone poses for training, if you can modify its data loader to use our openpose based data, I think it will be a very good job. I think if you modify this model, you can train the first long video sign language recognition model, and the heat should be higher. Most of the models and subsequent work on pose-to-video are based on openpose, not i3D. If you were to do it, it would be a great boost to the field. We can do it together if you want.
@SignDiff
Thank you for your advices and help. It would be awesome if we could work together on this. One more thing I would like to mention is that, together with your openpose data, I have preprocessed the How2Sign using MediaPipe. It seems to have good estimation. I would to compare the data acquired by both of these and see which one would perform better. Through Mediapipe, I don't need to use 2D to 3d kinematics.
Anyway, I would like to discuss it more with you, if it possible. Looking forward to hearing from you soon.
@SignDiff
Thank you for your advices and help. It would be awesome if we could work together on this. One more thing I would like to mention is that, together with your openpose data, I have preprocessed the How2Sign using MediaPipe. It seems to have good estimation. I would to compare the data acquired by both of these and see which one would perform better. Through Mediapipe, I don't need to use 2D to 3d kinematics.
Anyway, I would like to discuss it more with you, if it possible. Looking forward to hearing from you soon.
@khannabeela
I used to think about reducing the 2D to 3D process, which would definitely be more accurate. It would be great if How2Sign could provide native 3D. I considered training directly with 2D data at that time, but there were many negative impacts that would affect the future application prospects, so I gave up.
If you consider Mediapipe, it may have some advantages in terms of mobile phone and accuracy. If you are generating Mediapipe, you need to consider whether there is a pose 2vid model that uses Mediapipe as input. Controlnet has pose 2img methods such as Openpose, DensePose, Line Draft, and Depth Map to control posture. You may need to create a text2pose+pose2vid based on Mediapipe yourself. If you can complete the entire pipeline, this will be a beneficial contribution.
However, I suggest that you complete a sign language recognition model based on media pipe (sign2text), as this is more important. The one above is more like a branch at the end of a branch, which others may see and reference, some may continue with your work, and some may not.
If you have already processed the data and improved accuracy, you can try using the recognition model above. For sign language recognition, your model will not encounter any issues due to interfaces (such as the subsequent pose2img). For sign language recognition, there is not much involved, of course, the higher the accuracy, the better. And it is also easy to develop a sign language recognition software for mobile phones, so that the relatives of deaf mute people can understand them at any time.
So yes, you can do the first ASL long video recognition based on media pipe, and most importantly, you must make it more convenient as the previous recognition model was too difficult to use.
@SignDiff
Thank you for the guidance. If possible, could you please share with me your pretrained model for recognition ? I think it would help me a lot to do a further investigation and improvement to the setup.
@SignDiff
Thank you for your guidance. Could you please tell me if you have shared the src_vocab.txt with your preprocessed data. I am experimenting with your preprocessed data and would like to see the results compared to my preprocessed data using Mediapipe. If possible, could you share the file?
@SignDiff
Thank you for your guidance. Could you please tell me if you have shared the src_vocab.txt with your preprocessed data. I am experimenting with your preprocessed data and would like to see the results compared to my preprocessed data using Mediapipe. If possible, could you share the file?
I uploaded a script to get a glossary, you just put it in the same folder like an image and run it, you get a glossary.
I also uploaded three vocabularies for immediate use.
@SignDiff
Thank you so much. Really appreciate it. I will inform you of the results I notice.
@SignDiff
Thank you so much. Really appreciate it. I will inform you of the results I notice.
@khannabeela Ok, by the way, you seem to be familiar with extracting key points with mediapipe. I don't know if you've used openpose to extract key points on a large scale. I want you to help me extract a dataset using openpose, similar to the original json format in the How2Sign paper. The one with the eight key points on the upper body and 21 on each hand. About 20GB. I can handle it myself, but I'm a little busy, and if you happen to have the tools right here, that would be great.
@SignDiff
Yes, I use Mediapipe for extracting 3D key points directly from How2Sign dataset. Sure, I can help you with the key points extraction. Could you give me more details about the dataset. Also, do you want me to use OpenPose only?
@SignDiff
Yes, I use Mediapipe for extracting 3D key points directly from How2Sign dataset. Sure, I can help you with the key points extraction. Could you give me more details about the dataset. Also, do you want me to use OpenPose only?
@khannabeela
I spent several hours uploading the dataset, but the response was a bit slow. This is the download address for the dataset: https://drive.google.com/file/d/18AOltFbiJev9-clJv_9BSGnM1bz5JKQN/view?usp=sharing
It is a zip file containing 10k videos, and I hope you can only use OpenPose to extract the key points from the JSON format file. The JSON key points of each video are stored in a folder with the same name. Generally speaking, by default, 24 frames of key points are extracted per second (or 24 times per second, such as a 5-second xx video, where there should be 120 JSON files in the xx folder)
For example, under the folder named _0fO5ETSwyg_0-5-rgb_front
, there are all the output json results of _0fO5ETSwyg_0-5-rgb_front
videos. I gave a json file reference format in the comments.
_0fO5ETSwyg_0-5-rgb_front_000000000000_keypoints.json
If you are willing to complete it, it is recommended to do so within seven days, as I will have more free time after that. If you are done, please email me the dataset with your name and information, sen.fang@live.vu.edu.au I will sign your name as the author on the relevant paper.
@SignDiff
Thank you for the data. I think seven days deadline is a little tight. However, If i am able to complete it, I will inform you.
@SignDiff
Hope you are doing well. I am sorry, I might not be able to do it completely on time. However , I will update you with the results I get.
@SignDiff Please check you email.
Hello, thank you for the great work. Can you please share the whole processed data for How2Sign dataset? you said that later you processed 4 times the size in the paper.