bowang-lab / scGPT

https://scgpt.readthedocs.io/en/latest/
MIT License
1.01k stars 196 forks source link

Non-zero genes #110

Closed yuzhenmao closed 11 months ago

yuzhenmao commented 11 months ago

Hi,

Thank you for the amazing work! I noticed that you mentioned "... use all non-zero expressed genes as input for pre-training". May I ask which code piece in the repo implements this, i.e., removing all the zero-expressed genes before inputting the non-zero genes into the model? I can only find https://github.com/bowang-lab/scGPT/blob/dev-temp/scgpt/preprocess.py#L258, which does a similar thing but still returns the whole gene list including zero-expressed genes. Please correct me if I am wrong.

Thank you so much!

subercui commented 11 months ago

Hi, thank you for your question. Removing zero genes is actually done in preparing the data. We processed the original data to an efficient customized data structure, and during that process we only retained the genes with expressions. For example, you can find the data conversion here https://github.com/bowang-lab/scGPT/blob/dev-temp/data/cellxgene/build_large_scale_data.py#L193-L201 .

yuzhenmao commented 11 months ago

Thank you so much for the prompt reply! I see. Just to make sure, is this implemented in: https://github.com/bowang-lab/scGPT/blob/dev-temp/scgpt/scbank/databank.py#L758? Thanks!

subercui commented 11 months ago

Yes, exactly that place