Closed gsndr closed 3 months ago
AggregateBackbone concatenates multi-scale feature maps extracted from different groups of images; within each group, the feature maps undergo temporal max pooling to compute one multi-scale feature map for the group.
[[0,1,2,3,4,5,6,7]]
has only one group so it computes temporal max pooling across all input images.
[[0,1,2,3],[4,5,6,7]]
has two groups, so it pools the features of images 0-3 and 4-7 and then concatenates the result. This is useful if there is four images before an "event" and four images after and the model should compare them.
The pretrained models are only designed to work in the setting where all the features go through temporal max pooling. So for different numbers of images, you would set the groups to [[0, 1, ..., N]] where N is the number of images. If you have a task where time series is particularly important then you can keep the Swin backbone and apply it on each image, but then use a different architecture to process the extracted features over time.
Could you explain the meaning of this variable? How to change this to make it suitable to my time series data? self.groups = [[0, 1, 2, 3, 4, 5, 6, 7]].
In addition, is it possible to use the multi-image pretrained models with more than 4 images? If yes, could you indicate how to change the code to work with more images?