Question about Sparsity-Based Temporal Augmentation

BluebirdStory commented 2 years ago

Hi👋 , great job, yet I still got a question about the Sparsity-Based Temporal Augmentation. In the paper you mentioned "each augmented clip was con- strained to have length of 30-frame. Therefore, the longest clip (i.e., stride of 5) contained 5-second information, while the shortest clip (i.e., stride of 1) had 1-second information" in section 4 Experiment Setup. However, as far as I know, given two fix-length video clips(30-frame here), if they have different fps, then they ought to represent different heart rate(the larger the fps is, the lower the heart rate. the smaller the fps is, the larger the heart rate). The Sparsity-Based Temporal Augmentation used in the paper will generate two 30-frame clips with different HRs from the exact one source video. Not only the HRs is different, so do other physiological signals I suppose. Since the two fix-length video clips have different physiological signals, how can they be trained with contrastive loss? I mean, intuitively, the contrastive loss used in paper ought to help model learn the physiological signal invariance from two views of the same video and learn the difference of physiological signals from different video. It's a bit confusing, I would be very appreciate if you could answer the questions, many thanks.

Dylan-H-Wang commented 2 years ago

Hi, Thank you for being interested in our work!

The sparsity-based temporal augmentation is designed based on the Nyquist–Shannon sampling theorem. Suppose the HR value we have is 120 bpm which means the sampling rate has to be at least 120/60*2=4, i.e., 4 fps. Theoretically, videos with over 4 fps should be able to recover the same HR if you can perfectly extract underlying signals from them. However, this is not realistic for the real-world application since there are too many noises we need to account. Therefore, our aim is to train a model which learns to ignore these noises and find the underlying invariant signals.

Think about the data augmentation settings in MoCo v2, they used random crop with ratio of [0.2, 1]. How do you expect two crops from corners look similar?

Cheers

BluebirdStory commented 2 years ago

Hi, Thank you for your reply! I am afraid I still can't follow up, the algorithm do work, but I am thinking it's because of something unknown instead of what's explained in the paper. Here is an example. Actually we need to notice that model does not see the fps or stride of the input video clip, it only see the input 30 frames. Imagine that we use stride 1 and stride 5 to generate two video clips respectively from exact same video, the video clip generated by stride 5 will squeeze 5-second BVP(or ECG) wave signal into 30 frames, while the stride-one video only squeeze 1-second BVP wave, thus the 5-second BVP will seem much denser than the 1-second BVP. Specifically, say I have a 40-bmp video and a 200-bmp video, the video clip generated from the 40-bmp video using stride 5 will have exactly the same HR with the video clip generated from the 200-bmp video using stride 1, given that the model does not see fps or stride. I wonder how can the model recover the same HR? Actually this is a data augmentation method used in rPPG field, by using stride to change the GT HR of a video, you can see it in 《Robust Remote Heart Rate Estimation from Face Utilizing Spatial-temporal Attention》. Thank you for your patience, I would be very appreciate if you can replay, many thanks.

Dylan-H-Wang commented 2 years ago

I see what you are confused about. I agree with your opinions about the change of HR due to the fps changes. The paper you refer is using HR ground truth to constrain the model and assumes the input video has the same time-length.

However, it is not our case and the underlying theory of our work remains the same. The objective of sparsity-based temporal augmentation is to find invariant recovered HR of the original videos instead of the processed clips. It is true that the HR of processed clip will change only if we say the signals behind them have the same time-length. But we do not have such labels to force this condition. Instead, we use contrastive loss to force the model recover the original video HR and model should learn the knowledge of stride/fps. Additionally, this knowledge is then reinforced by the later pseudo-label classification task.

Cheers

BluebirdStory commented 2 years ago

I get it now, thank you for your patience. I realized now that it's the rPPG version of MoCo, but not that intuitive, really wonder how the model can learn the knowledge of stride/fps. Again, great work, many thanks ! 😊

Dylan-H-Wang / SLF-RPM

Question about Sparsity-Based Temporal Augmentation #3