bowang-lab / scGPT

https://scgpt.readthedocs.io/en/latest/
MIT License
997 stars 190 forks source link

Question about the value of cls token #186

Open SELECT-FROM opened 5 months ago

SELECT-FROM commented 5 months ago

Thank for your amazing work. I have some questions about the value of cls token. During pretraining, the value of cls is pad_value(default is -2), while during the finetuning of integration, the value of cls is 0. Is there any special purpose in this design as the value of cls token is different between the pre training stage and the finetune stage?

https://github.com/bowang-lab/scGPT/blob/4068d67caaac1e28d56964da68e0214817e38428/examples/pretrain.py#L430-L441 https://github.com/bowang-lab/scGPT/blob/706526a76d547de4ed711fa028c99be5bdf6ad8a/scgpt/tokenizer/gene_tokenizer.py#L298-L300

During finetuning of batch integration, model work as self-supervised training. When masking the gene expression value, the value of the cls token may also be masked. But this situation will not occur during the pre training process. I want to know why the value of cls token is also likely to be masked in batch integration finetune. What is the reason for this design? https://github.com/bowang-lab/scGPT/blob/706526a76d547de4ed711fa028c99be5bdf6ad8a/scgpt/tokenizer/gene_tokenizer.py#L467-L472 https://github.com/bowang-lab/scGPT/blob/4068d67caaac1e28d56964da68e0214817e38428/scgpt/data_collator.py#L417-L422

noob000007 commented 2 months ago

wow, i think them made a mistake? maybe?