Open veizgyauzgyauz opened 4 years ago
Thanks for the attention! The code will be released within this month. About the training, (1) five previous frames are used as reference, same as [7]; (2) yes, all reference frames are used to reconstruct the query at the same time, with a big affinity matrix computed and softmax applied afterwards.
@veizgyauzgyauz
If you want to get the rough idea, please use this code for the first, https://github.com/zlai0/MAST
Thanks! @fangruizhu @WeidiXie
I am just curious about the application of momentum update in the VOS field. It seems that for each target frame, you choose K previous frames from the same video sequence as reference and treat them as the frames from the memory bank. In a forward pass, update the key encoder using momentum update first and then reconstruct the target frame. In the next iteration, the reference frames in the last iteration will not be used again. Instead, previous frames from the current video sequence serves as the memory frames. However, the keys in the memory dictionary are independent (the images are taken from different scenes) and are used for several iterations until the are dequeued in MoCo, while the keys from the memory bank have consistency and are dequeued after one iteration in VOS. Will it damage the smoothness of the key encoder?
I tried to implement the momentum memory mechanism into MAST. Everything is the same as MAST except that I apply the momentum update on the key encoder during finetuning. The pairwise training went smoothly and similar results were obtained as MAST. But at the finetuning stage, the training loss kept flutuating. Besides, I achieved 0.57 J&F-mean, 0.56 J-mean, and 0.60 F-mean which are lower than the ones before finetuning. I really want to figure it out.
@veizgyauzgyauz Hi, our momentum update is quite different with MoCo, where frames in the memory bank are changed constantly, different among iterations. Since our task does not aim for learning instance discrimination, but for the matching ability of the key and query encoder, which may not be disturbed by the changing memory. Also, a relative large momentum value helps maintain the smoothness. We use two encoders (key and query) during training, where params. of the query encoder are always updated with BP and the key encoder with momentum update. Besides, in this part https://github.com/zlai0/MAST/blob/master/models/colorizer.py we take all pixels to compute similarity and apply softmax.
Do you mean that you use a global attention rather than a restricted attention during training and inference? I think it may introduce a high meomry usage and, not surprisingly, RuntimeError: CUDA out of memory. occurred in the inference! Could you plz tell me what GPU you use? Maybe I need to estimate whether my machine can manage it. Thanks! @fangruizhu
Hi, the size of the affinity matrix during training is B x (96x96) x (25x25), where 96 is the size of feature map, 25 the window size. 16G V100 is enough for training, with 6 images per gpu. At inference, due to the use of images with full-resolution (same as MAST), a 32G V100 is needed.
Hi,
Great job! May I know when the code will be released?
Thanks.
Hi, thank you for sharing such a great job! I wonder when will you release the code? I have some questions about the implementation of the momentum memory. While training, (1) how many previous frames do you use to serve as reference frames? (2) do you use the reference frames to reconstruct the query frame seperately (i.e. one by one) or at the same time (apply a softmax afterwards) like [7]? I would appreciate it if you could help me. Hope you can release your code soon. Thank you!