Closed Hanminghao closed 1 month ago
After I changed to distributed training, I got results similar to the manuscript.
Hi, @Hanminghao ,
I encountered the same issue with you about the low correlation values while training on an A6000. Have you been able to identify the cause? Any insights would be appreciated
Why does distributed training have such a significant impact? I also noticed that the contrastive loss differs from methods like MoCo-v3, as it doesn't gather samples from other GPUs for loss computation. Could you also share the batch size you used for each GPU?
I'm sorry to hear that, but in my many tests, I can get good results without using distributed training. My specific operation was to set the batch size to 128, and the following results could be obtained when training on a single A6000: HVG 10.27 HEG:18.97. Note that I did not test only on a single slide as set in the original text, but tested on four slides respectively and calculated the average value.
Thank you for your response. I’m now getting reasonable results using a batch size of 128 and gradient accumulation of 4. This setup mimics bsz=512 on distributed training by reducing the loss instead of concatenating logits before calculating the loss.
I’m quite surprised by how sensitive this type of model is to hyperparameters like batch size.
Model | Mean Correlation (Cells) | Max Correlation | Mean HEG | Mean HVG | Mean Markers |
---|---|---|---|---|---|
BLEEP (bsz=128, accum=4) | 0.8025 | 0.6810 | 0.1630 | 0.1657 | 0.2280 |
BLEEP (bsz=128, accum=1) | 0.7149 | 0.6282 | 0.1096 | 0.0988 | 0.1158 |
Hello author, first of all thank you for your outstanding work. I had a problem with poor test set results while experimenting entirely with your code and data. Specifically, the model is running on a single A6000 48GB GPU. I have set the batch size to 512 and the learning rate to 1e-3. Apart from not using distributed training, my setup is the same as yours. In addition, the best model filtered by the validation set appears at epoch 38, with a val loss of 3.38. However, the model is only showing performance around 0.02 and 0.03 on the metrics of mean correlation highly expressed genes and mean correlation highly variable genes. I really look forward to your receiving my questions. Thank you.