campbio / celda

Bayesian Hierarchical Modeling for Clustering Single Cell Genomic Data
http://bioconductor.org/packages/celda
MIT License
148 stars 27 forks source link

decontX not reproducible? #377

Closed trnguyenuka closed 2 years ago

trnguyenuka commented 2 years ago

Hello campbio,

thank you for developing a great tool. I have been using this tool for the process of estimating contaminations in my scRNA-seq projects for a while. Recently, when I re-run an analysis pipeline for an old data, I notice a big difference in the estimated contamination level output from decontX. Since I apply a filter "AmbientRNA < 0.5", this difference changes the result of all downstream steps...

I'm quite sure that I didn't change anything in my code nor the input data. So I'm just wondering where the problem could come from. As I see in the function "decontX", there is an input argument for "seed", does that mean "decontX" rely on a stochastic algorithm?

Moreover, does different version of "celda" affect heavily on the results? I guess I have made a silly mistake; that I didn't keep a recod of SessionInfo of the run back then, so now I cannot recall which version I have used back then ... :(

Thank you very much and I'm looking forward to your reply. Best regards,

H.N

joshua-d-campbell commented 2 years ago

Hi @trnguyenuka, thanks for using our tool! I am sorry about the extra hassle related to reproducibility. I don't think the underlying algorithm or random initialization for the variational inference has changed. However, the initialization related to the clustering has been updated. So the initial cluster labels that decontX uses may be somewhat different in newer versions. We rely on functions the scater and scuttle packages for the initial clustering. So if things changed in newer versions of those packages, then this may also change the output of decontX somewhat.

If you can get the original decontX cluster labels and supply them in to the z parameter in a new decontX run, then I think you should get the same results (but let me know if that is not the case). Or if you know the R version you used, you can try installing decontX and its dependencies from the corresponding Bioconductor version.

If the initial clustering is similar, then the decontX results should be similar. I wonder if you can see if the major cell types are being correctly clustered in your newer decontX runs.

You bring up a good point related to reproducibility that it is a generally good tip to put the sessionInfo() call at the end of all your scripts. In the future we are thinking of ways to automatically include the version information in the results output of the SCE object in case people forget to do this.

trnguyenuka commented 2 years ago

Hi Mr. Campbell,

thank you so much for your prompt and detailed response. I have found the cause of my issue: That was not because of decontX; but the issue comes from the recent update of uwot. I guess in decontX you have used uwot to generate the UMAP, right? After downgrading uwot to 0.1.11 I have got the same results as before.

Thank you again and all the best, H.N

joshua-d-campbell commented 2 years ago

Thanks so much @trnguyenuka! Yes, that is correct, we also make use of the uwot package and underlying changes to that will also affect the clustering. I am going to move this to a Discussion thread so others can also see it.