gao-lab / Cell_BLAST

A BLAST-like toolkit for large-scale scRNA-seq data querying and annotation.
http://cblast.gao-lab.org
MIT License
86 stars 13 forks source link

Interspecies integration #19

Open bschilder opened 3 years ago

bschilder commented 3 years ago

Hello,

Thanks again for such a great tool!

I'm currently trying to use Cell BLAST to integrate some pretty diverse datasets; several mouse scRNAseq atlases, a zebrafish dataset, and fly dataset (all of which are mostly from central nervous system).

I'm struggling to find a tool that's able to handle this amount of diversity, and have had variable success with Cell BLAST. Here's some steps I've taken:

  1. Convert all gene symbols to their human orthologues using homologene.
  2. Discard any genes that do not have 1:1 orthologues in humans. This leaves ~16,000 genes in mouse, ~10,000 genes in zebrafish, and ~1,000 genes in fly.
  3. Select genes using find_variable_genes() after reducing min_group_frac=, since the default 0.5 only returns ~60 genes (which doesn't seem like it would be enough info to integrate the datasets well). I've played around with this parameter and run DIRECTi with anywhere from 60 to 400 to 4,000 to all genes.
  4. Train the DIRECTi model on the raw expression data from all these datasets at once, while controlling for study and species as batch effects."
  5. Reduce the latent factors into 2D UMAP space (tried both using the visualize_latent() and manually running UMAP).
  6. In all cases, the datasets didn't seem to integrate very well, in the sense that I expect neurons to cluster with neurons, oligodendrocytes to cluster with oligodendrocytes (regardless of the species or study they're from). Instead, they just cluster by the dataset they're from (even with the same species).
Screenshot 2021-04-23 at 22 30 55
var_genes_study, axes_study = combined_dataset.find_variable_genes(grouping="study", min_group_frac=.25)

model = cb.directi.fit_DIRECTi(combined_dataset, 
                               genes = var_genes_study, 
                               latent_dim=10, cat_dim=20, 
                               epoch=50,
                               batch_effect = ["study","species"],
                               path = model_dir
)

Is there anything you can see that I might be doing wrong, or do you have any recommendations to improve the integration in this case? I've been finding that most tools have trouble with integrating data from species this divergent, probably in part due to the fact that most genes are 0s for some species. I've also tried using gene intersections, but this only leaves ~400 genes across mouse + zebrafish + fly, which doesn't seem to be enough to differentiate cell-types (and certainly not sub-types).

Thanks so much in advance, Brian

Jeff1995 commented 3 years ago

Thanks for your interest in Cell BLAST and the detailed explanation!

Given the above information, I can think of two potential fixes:

  1. One possibility is that the model has not fully converged. You may try increasing epoch number to larger values like 200 or 500. The model will trigger early stop if loss has converged.

  2. In Cell BLAST we have a single tunable hyperparameter (denoted as lambda_b in the manuscript) that controls the strength of adversarial batch alignment. Larger values enforce stronger alignment, but also increase the risk of over-alignment. By default, we use lambda_b=0.01, which empirically produces good result in most cases.

    In this case, it seems that lambda_b=0.01 is insufficient for aligning datasets this diverse. My recommendation is to try increasing this hyperparameter to values like 0.1 or 1.0. That should produce better dataset mixing.

    The hyperparameter can be set like this:

model = cb.directi.fit_DIRECTi(combined_dataset, 
                               genes = var_genes_study, 
                               latent_dim=10, cat_dim=20, 
                               epoch=200, rmbatch_module_kwargs={"lambda_reg": 0.1},  # <- Changed here
                               batch_effect = ["study","species"],
                               path = model_dir
)
bschilder commented 3 years ago

Hi @Jeff1995 , thanks so much for the quick reply! These are all very helpful tips.

I tried setting lambda_reg to 0.1 and 1 as you suggested, but oddly this seems to have no effect on the results. I don't have a screenshot, but I think these plots are also identical to when I don't specify lambda_reg at all.

I even named the models differently and set reuse=False so that I wasn't accidentally plotting old results.

lambda_reg=0.1

Screenshot 2021-04-28 at 09 36 48

lambda_reg=1

Screenshot 2021-04-28 at 10 23 51

Do you have an idea what might be happening here?

Thanks, Brian

Jeff1995 commented 3 years ago

Well, that's weird... I have never seen anything like that. Maybe it's something in this data that triggered a bug in the model. Would you mind if you share the dataset you're using so I can have a closer look?

bschilder commented 3 years ago

Hi @Jeff1995 , sorry for the delay, had some issues finding a way to get my data shareable.

I think this should work now, but let me know if you have any issues. https://drive.google.com/file/d/1hXXnpXiRp7evl727V3onkx0WmThGFFkx/view?usp=sharing

Thanks again, Brian