A question about the halting score distribution code

In the paper, the halting score distribution is defined as below:

However, the corresponding code seems wrong. https://github.com/NVlabs/A-ViT/blob/120c9cb90acf86828f1c61dd42c08722aa7173c7/timm/models/act_vision_transformer.py#L464-L465

The shape of h_lst[1] is [B, N], so the code seems to average on the whole batch and ignores the first sample of each batch. I think the right code is: self.halting_score_layer.append(torch.mean(h_lst[1][:, 1:], dim=-1))

Can you tell me which one is correct? Thanks!

NVlabs / A-ViT

A question about the halting score distribution code #4