digitalcytometry / cytospace

CytoSPACE: Optimal mapping of scRNA-seq data to spatial transcriptomics data
Other
113 stars 19 forks source link

Integration error #94

Closed Ocean-Lyu closed 11 months ago

Ocean-Lyu commented 11 months ago

Hi, Thank you for developing such an exciting tool! I am trying cytospace for my 10X genomics snRNA-seq and 10X Visium data which were imported into cytospace by your script "generate_cytospace_from_seurat_object.R". However, I encountered the following error:

2023-11-09 21:50:07.627569 Integration Normalizing query using reference SCT model Performing PCA on the provided reference using 3000 features as input. Projecting cell embeddings Finding neighborhoods Finding anchors Found 56 anchors Finding integration vectors Finding integration vector weights Error in idx[i, ] <- res[[i]][[1]] : number of items to replace is not a multiple of replacement length Calls: get_cellfracs_seuratv3 ... FindWeights -> NNHelper -> do.call -> AnnoyNN -> AnnoySearch

And the log file:

CytoSPACE log file

Start time: Thu Nov 9 21:43:33 2023

INPUT ARGUMENTS scRNA_path: ./sham/shamscRNA_data.txt cell_type_path: ./sham/shamcell_type_labels.txt st_path: ./sham/shamST_data.txt coordinates_path: ./sham/shamCoordinates.txt n_cells_per_spot_path: None cell_type_fraction_estimation_path: None st_cell_type_path: None output_folder: ./sham_res mean_cell_numbers: 5 downsample_off: False scRNA_max_transcripts_per_cell: 1500 plot_off: False geometry: honeycomb output_prefix: seed: 1 solver_method: lapjv_compat sampling_method: duplicates distance_metric: Pearson_correlation single_cell: False

sampling_sub_spots: False

The input files look fine and I suspect there`re some problems when cytospace attempts to integrate the data.

erinlbrown commented 11 months ago

Hi, thank you for your interest in CytoSPACE!

By default, CytoSPACE relies on the Seurat integration pipeline to estimate cell type abundance in the ST sample. When there are few cells per type in your input single cell data set, there can be errors in this pipeline. You could try re-running CytoSPACE with the flag --downsample-off, which will turn off downsampling for the single cell data, but more likely you have few cells of one or more cell types in your input file.

If that is the case, there are a few options:

Hope that helps!

Erin

Ocean-Lyu commented 11 months ago

Thanks for your reply! I checked my snRNA-seq data and did find a cluster consisting of only a dozens of cells. Then I tried both options you offered. I removed this cluster and the piplines were successfully finished. But I considered this cluster biologically meaningful, I modified the "script get_cellfracs_seuratv3.R" and lowered the parameter k.weight in "TransferData()".

The result looks like this: "predictions.assay <- TransferData(anchorset = anchors, refdata = cell_index_vec, k.weight = 35)".

Both of them worked and some cell types were mapped beautifully. However, the resulting cell mapping can be slightly different when setting the k.weight with different values, setting k.weight = 35 and 30 for example. Moreover, when I compared the results from these two methods, I confusedly found that some cell types, which didn`t show up when lowering the k.weight, were able to be mapped when removing the cluster with few cells.

Lastly, not all the cell types were mapped even though my snRNA-seq llibrary came from the remaining tissue for the Visium library but I guess that was because of section thickness samlping derivations which lead to limited cell types captured on a single Visium section?

erinlbrown commented 11 months ago

I’m glad to hear that worked, at least to run the abundance estimation pipeline without error!

CytoSPACE mappings will be subtly different given variation in the cell type abundance estimates provided, but for reasonably small variation in abundance estimates, we expect CytoSPACE mappings to be of comparable quality. However, completely absent cell types can certainly pose a problem, either for downstream analyses or for the overall mapping if the cell types are expected to be present in significant abundance.

Since your snRNA-seq and Visium come from the same tissue, it seems reasonable to expect that cell types present in the snRNA-seq will be represented in the Visium as well. Of course, if this is a tissue sample with distinct regional morphology and the snRNA-seq section includes regions not represented in the Visium sample or extremely rare cell types, that may not be the case.

Without knowing more about your data, I cannot give any definitive recommendations, but I have a few suggestions you could consider assuming that the estimation of zero abundance for these clusters is not expected biologically.

That removing the cluster with few cells resulted in other clusters being mapped with non-zero abundance estimates suggests that you may have either (1) a large number of clusters or (2) clusters which are fairly similar in expression, particularly to the cluster with few cells. For either case, many abundance estimation pipelines can run into issues. One approach that may work would be to combine similar smaller clusters into larger groups as appropriate (e.g., combine subsets of CD4+ T cells into a CD4+ T cell group) for the abundance estimation step, assuming that within each group the relative abundance of each smaller cluster is preserved between snRNA-seq and ST data. Often this can be a reasonable assumption for paired samples even for scRNA-seq as many of the effects of cell dissociation impact broader lineages differentially more than closely-related sublineages. I would assume this would hold just as well or better for snRNA-seq. That being said, this is fundamentally a biological consideration, so of course the specific details of your tissue and data will matter.

If you do this, you can let CytoSPACE map according to these broader cell groups and then simply add the full annotation for downstream analyses. CytoSPACE samples cells at random by input cell type, so it is likely that the fractions actually mapped will match the fractions of the clusters within the broader groups. If you want to absolutely ensure that these fractions are preserved, you can convert the output of the cell fraction estimation step over larger groups to the full cluster-level values and then pass that file in directly.

Finally, if the above does not seem to fit your situation well, I would also recommend confirming that you are using the Seurat version (v3) included with CytoSPACE dependencies, as there are some slight differences in behavior between the versions.