Training the model from scratch

OmicsML / CellPLM

Official repo for CellPLM: Pre-training of Cell Language Model Beyond Single Cells.

BSD 2-Clause "Simplified" License

62 stars 5 forks source link

Training the model from scratch #9

Open MohammedZidane opened 6 months ago

MohammedZidane commented 6 months ago

Hi, Does the code have the option to train the model from scratch? I am learning about foundation models and would like to monitor the training process itself. I could not figure out if the code allows training it from scratch.

Could you let me know if this option exist?

Thanks!

wehos commented 6 months ago

Hello, thanks for your interest.

We did not release the pretraining code yet, however, the loss functions are preserved in the code base. In the forward process here, the loss is returned. You may deploy an optimizer on these losses.

Feel free to discuss here if you encounter any specific issues.

Best, Hongzhi

MohammedZidane commented 5 months ago

Thank you so much Hongzhi. I really appreciate that your are responsive.

As I mentioned I am leanring more about foundation models. I noticed that you are masking some of the x_seq even during a downstream task like cell type annotation. If that is true, I cannot get why. Should not be the masking only for the SSL implementation for the pretraining process?

Thanks

wehos commented 5 months ago

Thank you so much Hongzhi. I really appreciate that your are responsive.

As I mentioned I am leanring more about foundation models. I noticed that you are masking some of the x_seq even during a downstream task like cell type annotation. If that is true, I cannot get why. Should not be the masking only for the SSL implementation for the pretraining process?

Thanks

Thanks for your question. When the downstream objective is cell type annotation, the masking is effective in a similar way to input dropout. In the implementation of many deep learning models, input dropout is considered a seamless data augmentation, whose ratio may differ from hidden dropout. This technique generally works well (here is an example).

After all, feel free to remove it if hurts the performance!

MohammedZidane commented 5 months ago

got it! Thank you so much :)

MohammedZidane commented 5 months ago

Hi Hongzhi,

You suggested before that I can deploy optimizers in the cellformer.py file to do the pretraining of the model which makes sense but is not the imputation.py file like SSL implementation, in other words, it is possible to use this file for the pretraining?

Thanks

MohammedZidane commented 4 months ago

Hi Hongzhi, I have one more question. In the imputation downstream task, in the zinb.py file in the objective folder:

The input data in the notebook has 407 genes and you use x_dict['input_gene_mask'] to only get those 407 genes from the 19374 pretraining genes then you get 307 genes whose values will be predicted, no?

If my understanding is correct, I cannot get the meaning of the 'mean' values you obtain from the other zinb.py in the decoder folder. The 'mean' has 19374 values which are then filtered to 407 then reduced to 307. I cannot get the meaning of these values. I only can get that the 307 values could be the predicted values for the imputation task.

Thanks

wehos commented 4 months ago

Hi Mohammed.

I would love to help but I'm traveling for a few conferences these days. I'll get back to you as soon as I am available.

Best, Hongzhi

MohammedZidane commented 4 months ago

Thank you so much for your reply. Good luck :)