Attention in pre-training and fine-tuning & pre-training code

riu83 commented 6 months ago

Hi! Thank you for this amazing work! I've been reading the paper, code, and issues, and I'd just like to make sure I understand things correctly. This question is potentiality connected to some other issues, I think at least #34

My understanding is that in the pre-training phase, you apply the attention mask described in figure S1, which ensures that masked genes are not allowed to attend to other masked genes. Then, fine-tuning using the GEP/GEPC objectives is similar to the pre-training process, but a key difference is that the attention mask is not applied, i.e. genes with masked expression values are allowed to attend to one another. Is this statement correct?

Following on implementation, my understanding is that pre-training code is currently only in the dev-temp branch, including the code performing the attention masking in flash_layers.py. Do you happen to plan on releasing a pre-training tutorial? I have difficulties understanding examples/pretrain.py on this branch, mostly because I don't understand the input data format. A small example starting from some standard adata would be super helpful. Many thanks!

jkobject commented 5 months ago

Hi, Also on a similar note, in dev-temp you show that you get the attention weights from flashattn. However, as far as I can go back in their repo I always see that these are set to None for the FlashAttention Class

see here: https://github.com/Dao-AILab/flash-attention/blame/v0.2.2/flash_attn/flash_attention.py.

Is there a way to get the attention weights? I am free to discuss more if needed :)

fscdc commented 1 month ago

Hi! Thank you for this amazing work! I've been reading the paper, code, and issues, and I'd just like to make sure I understand things correctly. This question is potentiality connected to some other issues, I think at least #34

My understanding is that in the pre-training phase, you apply the attention mask described in figure S1, which ensures that masked genes are not allowed to attend to other masked genes. Then, fine-tuning using the GEP/GEPC objectives is similar to the pre-training process, but a key difference is that the attention mask is not applied, i.e. genes with masked expression values are allowed to attend to one another. Is this statement correct?

Following on implementation, my understanding is that pre-training code is currently only in the dev-temp branch, including the code performing the attention masking in flash_layers.py. Do you happen to plan on releasing a pre-training tutorial? I have difficulties understanding examples/pretrain.py on this branch, mostly because I don't understand the input data format. A small example starting from some standard adata would be super helpful. Many thanks!

I face the some problem, Have you solved this problem?

bowang-lab / scGPT

Attention in pre-training and fine-tuning & pre-training code #148