Questions on paper - Githubissues

Hello, tanks to your great work! I have two questions on the work, and hope your answer please.

Q1: Why Swin-T + Conv I'm wondering why not just use Swin-T Patch Merging to downsample the feature, or since you use conv operation to downsample and fuse hierarchicle information, it's necessary to use local self-attention or not ? Q2: Positional Embedding Your work use SPE and TPE as prior to enhance the model performance, whether lack of ablation study on the other PE method like learning PE or Sine? Cause I have done some related experiments, and the results show the benefits of prior PE are small, some conditions even worse than learning PE.

Best Regard!

MediaBrain-SJTU / TBP-Former

Questions on paper #1