Be able to leverage (x,y) coordinates of each region in a slide when pooling the sequence of region-level features into a single slide-level representation. Added different ways to do so:
deterministic (layer dubbed positional encoding)
learnable (layer dubbed positional embedding)
Working with different aggregation methods enables different position encoding strategies:
when concatenating region features from multiple slides into a single sequence, add an embedding for the (x,y) position of the region in the slide + an extra embedding to encode which slide that region belongs to (e.g. if there were two regions with coordinates (0,0) coming from 2 different slides, they would have 2 different positional embedding)
when processing each slide sequentially, the previous logic is no longer needed
Be able to leverage (x,y) coordinates of each region in a slide when pooling the sequence of region-level features into a single slide-level representation. Added different ways to do so:
Working with different aggregation methods enables different position encoding strategies: