Performance of downstream tasks could not be reproduced

hzlllll commented 1 year ago

Hello, thanks for your great work! I'm interseted in the work, but can't get the similar result when reproducing the batch intregration and preturb prediction task by using the tutorial code and pretrained checkpoints. It seems still has some gap with the best result in the preprint. Is there any tricks or matters needing attention during the downstream tasks?

subercui commented 1 year ago

Hi, thank you for the question. We did intend to make the notebook run smoothly as it is, and there are supposed to be no additional tricks needed after a proper installation. The tutorial notebooks were tested before we updated them. We tried to set random numbers in most places. There is still some uncontrollable randomness brought by different software versions and even CPU or GPU specs. Overall, technically, you should get the same or similar results as shown by running the notebooks.

So, could you point to me which notebook you are running, and do you mind sharing some results you have now if possible? If you are running multiple ones, it could be easier to show one of them first. I would guess it could be some env settings. I'll try to see what is going on and help with reproducing the results in the notebook.

hzlllll commented 1 year ago

Hi, thank you for the question. We did intend to make the notebook run smoothly as it is, and there are supposed to be no additional tricks needed after a proper installation. The tutorial notebooks were tested before we updated them. We tried to set random numbers in most places. There is still some uncontrollable randomness brought by different software versions and even CPU or GPU specs. Overall, technically, you should get the same or similar results as shown by running the notebooks.

So, could you point to me which notebook you are running, and do you mind sharing some results you have now if possible? If you are running multiple ones, it could be easier to show one of them first. I would guess it could be some env settings. I'll try to see what is going on and help with reproducing the results in the notebook.

Thanks for your reply! The intregration result I produced is Avg_batch 0.984(higher than the preprint) Avg_bio 0.757(lower than the preprint). And I found that the lower loss in test means higher Avg_batch but not higher avg_bio. I wonder which part of loss is mainly related avg_bio. It seems ECS is related, but after removing it the Avg_bio is still near 0.75.

subercui commented 1 year ago

Hi, I see. I think for that experiment in the tutorial, you would normally see avg_bio above 0.81. Please see these two runs I recently reproduced https://wandb.ai/scformer-team/scGPT-public/runs/3lbdx364, https://wandb.ai/scformer-team/scGPT-public/runs/3b7pxf1g . On the other hand, the evaluation metrics for integration can fluctuate a bit since metrics like NMI and ARI are sensitive to recognized clusters. It may occasionally throw out different metric scores. I would suggest (1) having a try with different random seeds (2) as long as the clusters are visually well separated and align well with the cell type annotations. That indicates the workflow works properly.

hzlllll commented 1 year ago

Thanks for your reply!

wconnell commented 1 year ago

I'm having trouble reproducing Tutorial_Integration.ipynb, the only change I made to the code was disabling flash attention (couldn't get it to install). I've run the nb twice and got the following results:

This is with the scGPT_human pretrained model available for download in the google drive. I might try the script next examples/finetune_integration.py

subercui commented 1 year ago

Hi @wconnell , currently you will need flash attention to load the pretrained model weights correctly. This is because the names of the parameters differ from naive pytorch MHA. We are working on it and will support running with naive pytorch very soon in the coming days. Please have a look at #39 as well, I will mark the todo items there once completing the updates

wconnell commented 1 year ago

using flash attention (only code change was reducing the batch size), trouble reproducing

seed=42

seed=41

subercui commented 1 year ago

Hi @wconnell , what is the batch size you currently use? Choosing an appropriate batch size can be important for fine-tuning, I would recommend using the default value as provided if memory allows. On the other, you may also have a look at the UMAP results, and see the comment here as well https://github.com/bowang-lab/scGPT/issues/64#issuecomment-1683413837

wconnell commented 1 year ago

I used a slightly smaller batch size, 56 for the first run and 48 for the second run.

Looking back at these runs, the first model in fact achieved the reported avg_bio about halfway through training. 👏

loss (I think val_loss is used to select best model) does not seem to be well correlated with avg_bio performance though

there seems to be a bit of brittleness in a fine tuning scenario, wondering how this can be mitigated 🤔 going to keep benchmarking, hopefully this is helpful!

wconnell commented 1 year ago

I think its also quite important to point out the metrics on the PCs of preprocessed data

subercui commented 1 year ago

Hi @wconnell , regarding your last two comments:

I used a slightly smaller batch size, 56 for the first run and 48 for the second run.

Looking back at these runs, the first model in fact achieved the reported avg_bio about halfway through training. 👏

For a self-supervised learning scenario like this, the "vla_loss" indicates how well the model can reconstruct the data. So I think in general it reflects how well the learned cell embeddings are, and thus used this score to select the best model. If you look at the logged umap at 10 and 15 epochs in wandb, I would guess they may look similar for this data, and the umap at 15 epoch may have more subtle structures like subclusters. On the other hand, the reported scores like avg_bio are computed by matching the results with human labels, so even if the subtle structures may actually reveal more biological signals such as subcelltype clusters, it may lead to lower scores.

This is related to the points made in the comment https://github.com/bowang-lab/scGPT/issues/64#issuecomment-1716701492 . I think a proper evaluation for data integration is still a challenge in the field. Therefore, we listed both the scores and all the umaps when doing the comparison in the manuscript. A combined view of both should give a better evaluation. I would strongly recommend looking at the umap as well, and as long as the clusters are visually well separated and align well with the cell type annotations, it should indicate a proper integration result.

It is true this tutorial data is of rather small size and of high sequencing quality. PCs may also work fine since there is small batch effect in the data. Scores on this data by other integration methods such as scvi and harmony are also listed in the online manuscript. In general, the value of integration methods can be better demonstrated on more challenging datasets with complex cell types and batch effects. We did experiments on a diverse range of datasets and you can find the results in the manuscript supplementary figures. We also plan to release a separate repo for generating the results on these other datasets soon

I hope these explanations make sense to you

wconnell commented 1 year ago

I think a proper evaluation for data integration is still a challenge in the field

I heartily agree, and successful application to downstream tasks is really the best proof IMO

subtle structures may actually reveal more biological signals such as subcelltype clusters

I concur here, and I wonder if there is a good experiment here biasing some fraction of cells in a cell type cluster (with biologically/technical noise) and observing if the model corrects. I expect it would!

To be clear, I think the results are strong and supported by good evidence, thanks for engaging ✅

subercui commented 1 year ago

Thank you and thanks for the great suggestions!

bowang-lab / scGPT

Performance of downstream tasks could not be reproduced #64