Reference atlases with counts data

mihem commented 2 years ago

Thanks again for this great package which is a great enrichment for the single cell RNA-seq community I think and helped me in better understanding T cell clusters.

@mass-a Would it be possible to provide the full data of the reference atlases, that means including count data, variable genes etc. (e.g. via figashare)?

The background is that with larger datasets integration via Seurat I have memory issues (with 64GB RAM available), plus it takes really long. Additionally, in several datasets integration fails with Error in idx[i, ] <- res[[i]][[1]]: number of items to replace is not a multiple of replacement length. I read issue 16, but in my dataset I have mostly T cells (between ~60-90% based on the filtering).

full log of make.projection with a list with 5 T cell datasets and ref_TIL

``` [1] "Using assay SCT for N_COVID_CSF" [1] "196 out of 1524 ( 13 % ) non-pure T cells removed. Use filter.cells=FALSE to avoid pre-filtering (NOT RECOMMENDED)" [1] "Transforming expression matrix into space of mouse orthologs" [1] "Aligning N_COVID_CSF to reference map for batch-correction..." [1] "DIRECTLY projecting query onto Reference PCA space" [1] "DIRECTLY projecting query onto Reference UMAP space" [1] "Using assay SCT for PostCOVID_blood" [1] "1234 out of 3667 ( 34 % ) non-pure T cells removed. Use filter.cells=FALSE to avoid pre-filtering (NOT RECOMMENDED)" [1] "Transforming expression matrix into space of mouse orthologs" [1] "Aligning PostCOVID_blood to reference map for batch-correction..." [1] "DIRECTLY projecting query onto Reference PCA space" [1] "DIRECTLY projecting query onto Reference UMAP space" Pre-filtering of T cells (TILPRED classifier)... Genes in the gene sets NOT available in the dataset: B.cell: 6 (12% of 50) CAF: 12 (24% of 49) Endo.: 11 (24% of 46) Macrophage: 5 (11% of 46) Mal: 8 (18% of 45) Computing within dataset neighborhoods Finding all pairwise anchors Projecting new data onto SVD Projecting new data onto SVD Finding neighborhoods Finding anchors Found 144 anchors Alignment failed due to: Error in idx[i, ] <- res[[i]][[1]]: number of items to replace is not a multiple of replacement length Warning: alignment of query dataset failed - Trying direct projection... Pre-filtering of T cells (TILPRED classifier)... Genes in the gene sets NOT available in the dataset: B.cell: 6 (12% of 50) CAF: 12 (24% of 49) Endo.: 11 (24% of 46) Macrophage: 5 (11% of 46) Mal: 8 (18% of 45) Computing within dataset neighborhoods Finding all pairwise anchors Projecting new data onto SVD Projecting new data onto SVD Finding neighborhoods Finding anchors Found 124 anchors Alignment failed due to: Error in idx[i, ] <- res[[i]][[1]]: number of items to replace is not a multiple of replacement length Warning: alignment of query dataset failed - Trying direct projection... [1] "Using assay SCT for PostCOVID_CSF" [1] "1497 out of 5878 ( 25 % ) non-pure T cells removed. Use filter.cells=FALSE to avoid pre-filtering (NOT RECOMMENDED)" [1] "Transforming expression matrix into space of mouse orthologs" [1] "Aligning PostCOVID_CSF to reference map for batch-correction..." [1] "DIRECTLY projecting query onto Reference PCA space" [1] "DIRECTLY projecting query onto Reference UMAP space" [1] "Using assay SCT for IIH_CSF" [1] "1159 out of 4887 ( 24 % ) non-pure T cells removed. Use filter.cells=FALSE to avoid pre-filtering (NOT RECOMMENDED)" [1] "Transforming expression matrix into space of mouse orthologs" [1] "Aligning IIH_CSF to reference map for batch-correction..." Projecting corrected query onto Reference PCA space [1] "Projecting corrected query onto Reference UMAP space" [1] "Using assay SCT for IIH_blood" [1] "1750 out of 8326 ( 21 % ) non-pure T cells removed. Use filter.cells=FALSE to avoid pre-filtering (NOT RECOMMENDED)" [1] "Transforming expression matrix into space of mouse orthologs" [1] "Aligning IIH_blood to reference map for batch-correction..." Projecting corrected query onto Reference PCA space [1] "Projecting corrected query onto Reference UMAP space" Pre-filtering of T cells (TILPRED classifier)... Genes in the gene sets NOT available in the dataset: B.cell: 6 (12% of 50) CAF: 12 (24% of 49) Endo.: 11 (24% of 46) Macrophage: 5 (11% of 46) Mal: 8 (18% of 45) Computing within dataset neighborhoods Finding all pairwise anchors Projecting new data onto SVD Projecting new data onto SVD Finding neighborhoods Finding anchors Found 118 anchors Alignment failed due to: Error in idx[i, ] <- res[[i]][[1]]: number of items to replace is not a multiple of replacement length Warning: alignment of query dataset failed - Trying direct projection... Pre-filtering of T cells (TILPRED classifier)... Genes in the gene sets NOT available in the dataset: B.cell: 6 (12% of 50) CAF: 12 (24% of 49) Endo.: 11 (24% of 46) Macrophage: 5 (11% of 46) Mal: 8 (18% of 45) Computing within dataset neighborhoods Finding all pairwise anchors Projecting new data onto SVD Projecting new data onto SVD Finding neighborhoods Finding anchors Found 129 anchors Pre-filtering of T cells (TILPRED classifier)... Genes in the gene sets NOT available in the dataset: B.cell: 6 (12% of 50) CAF: 12 (24% of 49) Endo.: 11 (24% of 46) Macrophage: 5 (11% of 46) Mal: 8 (18% of 45) Computing within dataset neighborhoods Finding all pairwise anchors Projecting new data onto SVD Projecting new data onto SVD Finding neighborhoods Finding anchors Found 150 anchors ```

sessionInfo

[1] future.apply_1.8.1 future_1.22.1 ProjecTILs_1.0.0 Matrix_1.3-4 [5] TILPRED_1.0.2 umap_0.2.7.0 SeuratObject_4.0.2 Seurat_4.0.4 [9] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 [13] readr_2.0.2 tidyr_1.1.4 tibble_3.1.5 ggplot2_3.3.5 [17] tidyverse_1.3.1

Therefore, I would like to use Harmony or Symphony which are much faster and need much less memory and are, in my experience, equally good or even better than Seurat integration ... although I am huge Seurat fan.

Maybe you could also think about using Harmony or Symphony because I think your visualizations are unique ... but probably that would mean completely rewriting the package .. so it would be great if you could provide the entire full atlases so that users can use these reference atlases with other algorithms that need the count data/variable genes.

Thank you, Mischko

EDIT: When integrating selected CD8 clusters of different conditions (between 400-900 cells) (both with and witout filter) to the beautiful ref_LCMV_Atlas, they all fail with the same error: Error in idx[i, ] <- res[[i]][[1]]: number of items to replace is not a multiple of replacement length. I lowered seurat.k.filter = 20 without any effect.

mass-a commented 2 years ago

Hello Mischko,

Here's a link to the TIL reference object containing un-normalized counts (in the @assays$RNA@counts slot): ref_TILAtlas_mouse_wcounts.rds

Please note that one of the datasets that compose the atlas ("Singer" in the $Study metadata field) was generated with smart-seq2 technology and directly reported normalized TPM counts. We tried to emulate raw counts for this study with an inverse transformation [2^(m)-1] on the matrix elements. Otherwise all other studies used 10x and we started from their reported raw counts, normalizing the data with a standard log1p transformation.

It is indeed puzzling that Seurat integration fails - I can see you have thousands of cells, but only 100-200 anchors are found for integration. Have you tried using a simple log-normalization of the data instead of SCT? in this way both reference and query would use the same normalization method.

Best -m

mihem commented 2 years ago

Hi Massimo,

thank you, very kind. Did you integrate those data then by "Study" or by "Sample" in the meta.data? Can you also provide those for the viral CD4 and CD8 reference atlases?

Yes very good point. I have tried simple log-normalization as you suggested. For the TIL reference atlas that worked well, but for the viral cd4 and cd8 datasets Seurat integration failed again. Do you have any other ideas?

Thanks, Mischko

mass-a commented 2 years ago

For the TIL atlas, the datasets were integrated by "Study" (according to the metadata field in the reference).

For the viral atlases, since we applied a simple log-transformation to each sample (see the NormalizeData function in Seurat), you should be able to obtain the raw counts by the inverse transformation. Otherwise, all data are publicly available as count matrices in GEO, if you want to process them differently – see the links to the data in the methods of the two publications: CD8 viral atlas – CD4 viral atlas.

Cheers! -m

mihem commented 2 years ago

Thank you!

mihem commented 2 years ago

Just as a sidenote: The Seurat integration of my dataset doesn't fail anymore with the update to Projectil v2.0. I am a little confused because I thought the problem was within the Seurat package.

alexvpickering commented 2 years ago

For the TIL atlas, the datasets were integrated by "Study" (according to the metadata field in the reference).

For the viral atlases, since we applied a simple log-transformation to each sample (see the NormalizeData function in Seurat), you should be able to obtain the raw counts by the inverse transformation. Otherwise, all data are publicly available as count matrices in GEO, if you want to process them differently – see the links to the data in the methods of the two publications: CD8 viral atlas – CD4 viral atlas.

Cheers! -m

I don't beleive the raw counts can be recovered from the output of the NormalizeData function as counts are first divided by the total counts for the cell. You would also need to know what those total counts were in order to recover the raw counts, which Seurat objects do not store as far as I know.

carmonalab / ProjecTILs

Reference atlases with counts data #23