UMass-Foundation-Model / FlexAttention

Apache License 2.0
31 stars 4 forks source link

[Question] Regarding batch size and fine-tuning methods #3

Closed yinangit closed 3 months ago

yinangit commented 4 months ago

Question

Great work !

  1. According to the paper, the batch size is set to 1152. How many graphics cards will be used during the training phase?
  2. Is training full fine-tuning or efficient parameter fine-tuning?
senfu commented 4 months ago

Hi, thanks for your interest.

  1. I double check the training script, and find that the batch size is not 1152. We use 192 V100 for training, and the batch size for each GPU is 2, so the total batch size is 384.
  2. We do full fine-tuning.
yinangit commented 4 months ago

Hi, thanks for your interest.

  1. I double check the training script, and find that the batch size is not 1152. We use 192 V100 for training, and the batch size for each GPU is 2, so the total batch size is 384.
  2. We do full fine-tuning.

Thanks for your reply. I would like to ask two more questions:

  1. Is CLIP fully fine-tuned or frozen?
  2. What are the ${N}_{SA}$ and ${N}_{FA}$ in the paper?
senfu commented 4 months ago
  1. CLIP is frozen.
  2. $N_{SA}$ is set to 8 empirically as the preliminary experiment shows that the first 8 layers does not have a good attention map. All remaining layers use FlexAttention.