Open ManuelSokolov opened 3 months ago
@ManuelSokolov, the training process involves various sources of randomness. For example, the first round of training uses SGD which will shuffle the data before each epoch starts and therefore create randomness. If you want to have a stable model, a better way is to increase the number of iterations during training (e.g., max_iter = 2000) at the cost of longer runtime.
@ChuanXu1 thank you for your response, the SGD flag is by default set to False so the randomness should not exist. Is there any other reason that can be driving this randomness - disabling feature selection when training seem to have disabled randomness in the model. Also, my goal in addition to stability is to obtain correct results - a model that classifies wrongly with high confidence scores is not helpfull in this case (the UMAP below shows the result of one iteration)
If I disable feature selection the result will always be same:
However, since the results with and without feature selection seem to be completely different, I am not sure if I can trust the model - can you please comment on this?
@ManuelSokolov, the first round of training always use SGD. use_SGD = False
(the default) is intended for the 2nd round of training after feature selection.
Sorry @ChuanXu1 you seem to have responded before I edited the response, disabling feature selection seemes to have stabilized the results however difficult to know what is right/wrong, please see message above
@ManuelSokolov, it is usually recommended to use feature selection to speed up the run and increase the accuracy.
In this case seems to be reducing accuracy by providing different results across iterations. I also looked into the annotate method and it does standard scalling before classifications, and this option cannot be set to false. What is your recommendation given this example?
Hi! I am doing label transfer from reference dataset and classifying two query sets that should contain exactly same cell types. I noticed that running across several iterations the classifications would be different each iterations.
As you can see in next plot I plotted for each sample (rows) the percentages of predicted cell types per sample (e.g for first sample in the graph, from the 25 iterations of cell types it got classifed 40% of the times as radial glia and 60% of the times as glioblast.
Is this behaviour expected/documented for cell typist ? What is recommended to do in this case?
Best Regards,
Manuel