Open bschilder opened 3 years ago
Thanks for your interest in Cell BLAST and the detailed explanation!
Given the above information, I can think of two potential fixes:
One possibility is that the model has not fully converged. You may try increasing epoch number to larger values like 200 or 500. The model will trigger early stop if loss has converged.
In Cell BLAST we have a single tunable hyperparameter (denoted as lambda_b in the manuscript) that controls the strength of adversarial batch alignment. Larger values enforce stronger alignment, but also increase the risk of over-alignment. By default, we use lambda_b=0.01, which empirically produces good result in most cases.
In this case, it seems that lambda_b=0.01 is insufficient for aligning datasets this diverse. My recommendation is to try increasing this hyperparameter to values like 0.1 or 1.0. That should produce better dataset mixing.
The hyperparameter can be set like this:
model = cb.directi.fit_DIRECTi(combined_dataset,
genes = var_genes_study,
latent_dim=10, cat_dim=20,
epoch=200, rmbatch_module_kwargs={"lambda_reg": 0.1}, # <- Changed here
batch_effect = ["study","species"],
path = model_dir
)
Hi @Jeff1995 , thanks so much for the quick reply! These are all very helpful tips.
I tried setting lambda_reg
to 0.1 and 1 as you suggested, but oddly this seems to have no effect on the results. I don't have a screenshot, but I think these plots are also identical to when I don't specify lambda_reg
at all.
I even named the models differently and set reuse=False
so that I wasn't accidentally plotting old results.
Do you have an idea what might be happening here?
Thanks, Brian
Well, that's weird... I have never seen anything like that. Maybe it's something in this data that triggered a bug in the model. Would you mind if you share the dataset you're using so I can have a closer look?
Hi @Jeff1995 , sorry for the delay, had some issues finding a way to get my data shareable.
I think this should work now, but let me know if you have any issues. https://drive.google.com/file/d/1hXXnpXiRp7evl727V3onkx0WmThGFFkx/view?usp=sharing
Thanks again, Brian
Hello,
Thanks again for such a great tool!
I'm currently trying to use Cell BLAST to integrate some pretty diverse datasets; several mouse scRNAseq atlases, a zebrafish dataset, and fly dataset (all of which are mostly from central nervous system).
I'm struggling to find a tool that's able to handle this amount of diversity, and have had variable success with Cell BLAST. Here's some steps I've taken:
find_variable_genes()
after reducingmin_group_frac=
, since the default 0.5 only returns ~60 genes (which doesn't seem like it would be enough info to integrate the datasets well). I've played around with this parameter and run DIRECTi with anywhere from 60 to 400 to 4,000 to all genes.visualize_latent()
and manually running UMAP).Is there anything you can see that I might be doing wrong, or do you have any recommendations to improve the integration in this case? I've been finding that most tools have trouble with integrating data from species this divergent, probably in part due to the fact that most genes are 0s for some species. I've also tried using gene intersections, but this only leaves ~400 genes across mouse + zebrafish + fly, which doesn't seem to be enough to differentiate cell-types (and certainly not sub-types).
Thanks so much in advance, Brian