Thanks for your great work!
I‘m interested in the codes about 'Vision Perceiver'.
The paper shows that,"We extract hidden states at layers {i = L/3, j = 2L/3, k = L − 1} for summarizing through vision perceiver, where L denotes the number of layers in the vision encoder."
I read the code and only found that the 'AttnPooler' get the final output of 'Vision Encoder' as input, and split it into 3 part averagely.
I wonder if I miss some details?
Thanks for your great work! I‘m interested in the codes about 'Vision Perceiver'. The paper shows that,"We extract hidden states at layers {i = L/3, j = 2L/3, k = L − 1} for summarizing through vision perceiver, where L denotes the number of layers in the vision encoder." I read the code and only found that the 'AttnPooler' get the final output of 'Vision Encoder' as input, and split it into 3 part averagely. I wonder if I miss some details?