Closed FanqingM closed 6 months ago
Hi, the training speed is indeed slow. For example, it required around 4 days to train an L2 model using 8 A100 GPUs. However, we believe such a slow speed is caused by our usage of the default selective scan function provided by the mamba-ssm
package. (We used the default implementation because we wanted others to reproduce our result with a minimum effort :) )
I also have tried to use "zigzag" scan, while it is much slower than just use flatten to transform the image to 1D seq. I wonder if you can report the training speed of your method?
Best regards!