OmicsML / CellPLM

Official repo for CellPLM: Pre-training of Cell Language Model Beyond Single Cells.
BSD 2-Clause "Simplified" License
67 stars 6 forks source link

Was it necessary to have the gene embedder part? #6

Closed MohammedZidane closed 8 months ago

MohammedZidane commented 9 months ago

Hi, Cool work! really liked it and looking forward to seeing it published! I am interested to know if you really needed the embedder part, could not it work by directly having the cell embeddings from the encoder. Because as far as I understand, spaformer did not need the embedder part and have overlapping tasks with cellplm.

Thanks!

wehos commented 9 months ago

Hello,

Thank you for your interest. The Embedder module is designed for the extendability of the pre-trained model. There are two major advantages: (1) It handles diverse input gene lists. If the downstream datasets only cover part of the pretrain gene sets, the embedder will automatically align them with the pretrain gene set without explicitly padding zeros in the data file. (2) We expect to provide some advanced features like adding new genes during fine-tuning. The embedder makes it possible, although we haven't done it.

I hope this addresses your concern.

Best, Hongzhi

MohammedZidane commented 8 months ago

Hi Hongzhi, Thanks for your reply. I got the point of the embedder's function but cannot this functionality be part of the encoder? I do not have a big experience with large language models but the few models I saw does not have an embedder that is why I am wondering.

Thanks

wehos commented 8 months ago

You may consider it as a substitute for tokenizer. In our case, each cell is naturally a token so we instead called it embedder. We may align the the naming in the future. Thank you for your suggestion!

MohammedZidane commented 8 months ago

got it! Thanks Hongzhi :)