Doubts about masking strategy

Hi! Thanks for the opensource code. I have the doubts about masking strategy. In the paper: Uniformly masking stage-1 input tokens from the H/4 × W/4 featuremaps would cause all tokens of stage-3 to have partially visible information and requires keeping all stage-3 tokens. Why the visible information will pass to the stage-3, if the images was masked in the first. Thanks very much!