Lung cancer data - Githubissues

SirKuikka commented 1 year ago

Hi,

It it a good idea to use scArches (https://github.com/theislab/scarches/blob/hlca_tutorial_improvements/notebooks/hlca_map_classify.ipynb) to map the HLCA data to lung cancer data? The cancer cells are probably quite different from the healthy epithelial cells. Could this analysis reveal something interesting about the query (cancer) data? Or was this scArches application mostly designed for healthy lung scRNA-seq data?

LisaSikkema commented 1 year ago

Hi @SirKuikka , as you can read in the paper we mapped lots of disease data to the healthy HLCA core, so it should work. The quality of the mapping mostly depends on how different your data is in terms of technologies used etc. compared to the HLCA core (which is single cell 10X), for example single nucleus might not give as nice a mapping as single cell.

Ideally you also have some healthy controls in your dataset, so that you can use those to check if the mapping went well (i.e. if controls from your data are mixing well with the HLCA core (controls only)). But I'd say just give it a try, and if the mapping works well it could indeed tell you something about which cell types are different in your disease of interest (lung cancer in this case) compared to healthy tissue. You can check out the paper for examples of that, we do it with lung cancer, IPF and more. All notebooks with analysis are available on the HLCA reproducibility GitHub

SirKuikka commented 1 year ago

Hi @LisaSikkema

I have only lung cancer cells, and the cells formed a cluster that is distinct from the reference cells when I used scArches to map the cells to the HLCA reference. The cells were close to epithelial cells, which makes sense.

I guess it's a nice visualization to show that our cells are different from normal lung cells.

Besides that, I have to think what else the latent features could tell about my data. That would these latent features be in some sense better than the ones get from e.g. Seurat's PCA. Do you think that there would be some reason to use the scArches latents instead of principal components?

LisaSikkema commented 1 year ago

That sounds exactly like what we found when mapping lung cancer to the HLCA, just check out fig. 4c and Extended Data figure 6 of our paper, plus the text with the figures.

The advantages of using a reference instead of just your own data are multiple:

you get label transfer, i.e. a fast first draft annotation of your data, included. This normally takes a lot of time. The annotations of the HLCA are based on consensus between 6 independent experts
you have a more complete "control": most individual datasets just include a few controls and do not cover the entire spectrum of what healthy controls could look like, which can result in mistaken conclusions about differences between healthy and disease (as for the lung cancer dataset we analyse in our paper). The HLCA includes >100 healthy controls
the latent space of the HLCA reference is trained on a much wider variety of data and cell types than any single dataset, and is therefore likely better at pulling apart distinct cell types than an individual dataset would be. For example, we were able to detect rare cell types in individual datasets using the HLCA reference that were not originally found/annotated in the datasets. This might not hold true for all cell types though; we were not so good at distinguishing detailed T cell subtypes in the HLCA

If you're interested in more details I would refer you to the paper, it discusses your question extensively!

LisaSikkema commented 1 year ago

@SirKuikka I will close this issue now but feel free to re-open if you have further comments

LungCellAtlas / HLCA

Lung cancer data #7