Open NikicaJEa opened 6 months ago
I am also interested in this
@NikicaJEa, if you only select a subset of genes for training, SGD is not necessary - you can safely turn it off to enable a canonical logistic regression.
Thanks for your reply @ChuanXu1. I experimented with all possible combinations I could think of: with/without SGD, mini batching, balance_cell_type, without subseting genes, with feature selection step. Unfortunately, none of these combinations yielded results comparable to the original immune_low model. Now I understand there is always some degree of randomness to be expected, but this is more than I would expect. It would be beneficial to understand the specific parameters under which the original model was trained.
@NikicaJEa, to produce a model with comparable performance versus the built-in models, you can use the same set of genes (which you had already done plus check_expression = False) and increase the number of iterations (for example, max_iter = 1000), with all other parameters being the defaults.
Hi I was trying to replicate your Immune_All_Low model by using the same training dataset that you kindly provided (CellTypist_Immune_Reference_v2_count). I tested the two models (original and the replicate) on a independent dataset. The final annotations differ quite a bit, especially I notice that the prediction scores from the original model are significantly higher (0-1) than of the replicate one (mostly around 0). Here is the model training code:
I also tried to see if it would make a difference if I would normalize after the gene subsetting but nothing changed much. Thanks for the help!