FangyunWei / SLRT

236 stars 46 forks source link

Request for Feature Extractions from Individual Stream Encoders in the Two-Stream Network Model #58

Closed Triver-ac closed 2 months ago

Triver-ac commented 3 months ago

Hi FangyunWei Team,

I am exploring the implementation of the Two-Stream Network from your SLRT project for sign language recognition and translation. My interest specifically lies in acquiring features extracted by each stream encoder independently, prior to any joint training.

Due to my limited resources and technical capabilities, I have struggled to replicate the Keypoint pre-training results described in your paper. After loading your provided checkpoint, I unfortunately observed suboptimal inference results on our 3090 GPU, which led me to seek your assistance.

Could you possibly provide the features extracted using only the V-Encoder and K-Encoder individually, as detailed in the first two rows of Table 3 in your paper "Two-Stream Network for Sign Language Recognition and Translation"?

Thank you very much for considering my request.

Best regards, RuiquanZhang

Triver-ac commented 3 months ago

If you could provide the features for both the Phoenix2014T and CSL-Daily datasets, I would be immensely grateful.

2000ZRL commented 3 months ago

Sorry for the late reply. What is the memory of your 3090 GPU? Since for extracting features, the mode is in the evaluation mode, and you can use a bach size of 1. Even the largest two-stream model can be ran in evaluation mode on a 16GB gpu. So, I think 3090 can handle it.

Triver-ac commented 2 months ago

Sorry for the late reply. What is the memory of your 3090 GPU? Since for extracting features, the mode is in the evaluation mode, and you can use a bach size of 1. Even the largest two-stream model can be ran in evaluation mode on a 16GB gpu. So, I think 3090 can handle it.

Hello,

Firstly, I would like to express my sincere appreciation for your team's outstanding contributions to the field of sign language recognition. I am currently attempting to replicate your single-stream keypoint-based sign language recognition model (K-Encoder, with WERs of 27.14% on the Dev set and 27.19% on the Test set) using an RTX 3090, but I have encountered some technical challenges.

I have reviewed the files shared via your link here, but I was unable to find the single-stream keypoint ckpt for the SLR stage. The available ckpt files, csl-daily_keypoint, phoenix-2014_keypoint, and phoenix-2014t_keypoint, appear to be generated during the translation stage and lack the recognition.backbone parameters.

To more accurately replicate and understand your research outcomes, I would like to inquire if it is possible to obtain the complete single-stream Keypoint SLR checkpoint file, or the SLR features derived from this model training? This would greatly assist my research. Thank you once again for your attentive response and support. I look forward to your reply and wish you continued success in your research efforts!

Triver-ac commented 2 months ago

Sorry for the late reply. What is the memory of your 3090 GPU? Since for extracting features, the mode is in the evaluation mode, and you can use a bach size of 1. Even the largest two-stream model can be ran in evaluation mode on a 16GB gpu. So, I think 3090 can handle it.

Thank you very much for your response; I am now able to replicate the results from the paper. I found that when reproducing TwoStream SLT and MMTLB, it is necessary to ensure the batch size is no less than 8. If there is insufficient GPU memory, using multi-GPU parallel processing combined with gradient accumulation can be effective in simulating this setup.