Maximum load of the first transformer iteration

PeiyanFlying / SPViT

42 stars 5 forks source link

Maximum load of the first transformer iteration #5

Open deanAirre opened 3 months ago

deanAirre commented 3 months ago

Good Evening,

I am very interested in research around making Transformers more approachable to public especially within community without good GPU.

I already solved previous question, but I want to ask another question. Does the first iteration of transformer learning is supposed to be heavy in load because it is filled with unpruned tokens? I know the first iteration transformers is necessary for determining which token to focus in the pruning section, but how much is the maximum load of the first iteration compared to the final pruning with good accuracy?

Thanks in advance, Regards, Sean.

ZLKong commented 1 month ago

Hi Sean:

The "first iteration of transformer" you mentioned, do you mean the first transformer block or the first iteration during training?

deanAirre commented 1 month ago

Dear PeiyanFlying,

Yes the first iteration during training before pruning happens. Also in case needed, have you heard about a method of actually 'infuse' trained model to the first iteration blocks of transformers so it doesn't have to do training from scratch?

Thanks in advance, best regards, Sean.

ZLKong commented 1 month ago

Hi Sean:

The first iteration of transformer learning should be in load because the pruning has not started, but it should be a similar load compared to the original ViT.

Regarding infuse, I am not sure about this. I assume this is similar to distillation, or lottery ticket method, where you get a good initial weight for the layers, and then do fine-tuning or training?

deanAirre commented 1 month ago

Dear PeiyanFling,

Yes, the first iteration should be as heavy as original ViT because no pruning has been done, so I was looking for a way to 'infuse' pretrained model so it doesn't have to go as heavy as original transformer. Since it is confirmed it will be as heavy I will look for a way, maybe distillation or lottery ticket method, to make SPViT even more lighter.

But then I wonder how your 'adaptive pruning' method will 'see where it suitable to stop' if it doesn't hold embedding table from ViT first iteration training, do you think it will still work if I 'distilled' model to first SPViT training layer so it goes straight to pruning?

Thanks in advance, the discussion have been very helpful, Sean