YMa-lab / CARD

GNU General Public License v3.0
99 stars 21 forks source link

Simulation dataset #53

Closed cristalliao closed 8 months ago

cristalliao commented 1 year ago

Hi, Thanks for this great package. I have a question about the simulation part of the paper. Regarding the simulation of a single-cell RNA sequencing (scRNA-seq) dataset, I'm considering the possibility of using it with other datasets containing predefined layer label data and location data, like the dataset shown in the following figure:

Screen Shot 2023-08-15 at 10 17 50 pm

I guess that the scRNA-seq dataset and the predefined layer label data and location data are not inherently related. Hence, I'm interested in exploring whether it's feasible to generate simulated data by combining these unrelated datasets.

So in my simulation, I use the location from the human heart dataset and single-cell RNA sequencing is from the simulation example. Does it make sense?

I'm curious if the simulated data resulting from this approach would still hold validity and coherence. Your insights on this matter would be greatly appreciated. Thanks a lot!

Best regards, Cristal

YingMa0107 commented 1 year ago

Hi @cristalliao,

Thank you for your question.

Technically, the scRNA-seq data, predefined layer labels, and location data don't have inherent relationships in common deconvolution setting. You can generate your own location data and then assign each location with a simulated layer label. Subsequently, you can use the scRNA-seq reference to simulate data for each layer. However, from an interpretation perspective, it's important to consider how well your simulated data reflects the true human heart dataset. Moreover, the relevance of your simulation also hinges on the specific analysis you're aiming to perform. For example, if your algorithm/method relies on heart tissue shape, its performance on simulated data that doesn't adhere to the natural heart tissue arrangement might not necessarily predict its efficacy on actual heart tissue.

A potential follow-up question might be why you're not using available human heart scRNA-seq reference data, given their abundance. While using alternative scRNA-seq data is reasonable if the former is unavailable, using the original reference data might raise queries. For a credible simulation, the goal is to closely replicate the characteristics of the real dataset.

I hope this helps! I'm open to further discussion if you have a differing perspective.

Best, Ying

cristalliao commented 1 year ago

Hi Professor Ying,

Thanks for your explanation! It is really helpful. I also think it is better to use the human heart reference dataset for the simulation.

The problem I have encountered is that I'm having issues creating the Expression Set for the human heart dataset, so I can't perform simulations based on the human heart dataset. Do you have any suggestions on how to create an expression set? Currently, I have two datasets: the first dataset, gene_expression_cell_ref, contains cell names and gene expression data, while the second dataset, celltype_ref_dataset_clean, contains cell names, sample IDs, and cell types. Is this information sufficient to create an expression set? I have no experience in creating the Expression Set for simulation

Screen Shot 2023-08-16 at 11 15 42 am Screen Shot 2023-08-16 at 11 13 22 am

Thanks in advance!

Best regards, Cristal

cristalliao commented 1 year ago

Hi Professor Ying,

I tried to use some methods to create the Expression Set. Using the code below:

gene_expression_cell_ref<- read.csv("/dski/nobackup/xiaoyinl/Clustering_results/dataset_slices_aligned/gene_expression_cell_ref.csv")
colnames(gene_expression_cell_ref)[colnames(gene_expression_cell_ref) == "X"] <- "cellname"
Screen Shot 2023-08-16 at 1 05 36 pm
celltype_ref_dataset<- read.csv("/dski/nobackup/xiaoyinl/Clustering_results/dataset_slices_aligned/celltype_ref_dataset.csv")
celltype_ref_dataset$celltype <- trimws(celltype_ref_dataset$celltype, which = "right")
colnames(celltype_ref_dataset)[colnames(celltype_ref_dataset) == "experiment"] <- "Exp"
celltype_ref_dataset$Exp <- as.integer(sub(".*Exp_(\\d+).*", "\\1", celltype_ref_dataset$Exp))
colnames(celltype_ref_dataset)[colnames(celltype_ref_dataset) == "Exp"] <- "sampleID"
colnames(celltype_ref_dataset)[colnames(celltype_ref_dataset) == "X"] <- "cellname"
colnames(celltype_ref_dataset)[colnames(celltype_ref_dataset) == "celltype"] <- "cellType"
celltype_ref_dataset_clean <- celltype_ref_dataset[, c("cellname", "cellType", "sampleID")]
rownames(celltype_ref_dataset_clean) <-celltype_ref_dataset_clean$cellname
celltype_ref_dataset_clean
Screen Shot 2023-08-16 at 1 03 39 pm
  1. Create an expression matrix
gene_expression_cell_ref_transpose <- t(gene_expression_cell_ref)
gene_expression_cell_ref_transpose_df <- as.data.frame(gene_expression_cell_ref_transpose)
colnames(gene_expression_cell_ref_transpose_df) <- gene_expression_cell_ref_transpose_df[1, ]
gene_expression_cell_ref_transpose_df <- gene_expression_cell_ref_transpose_df[-1, ]

gene_expression_cell_ref_transpose_df <- gene_expression_cell_ref_transpose_df %>% mutate_all(as.numeric)
exprs_matrix <- as.matrix(gene_expression_cell_ref_transpose_df)
  1. Create phenoData object
pheno_data_df <- celltype_ref_dataset_clean
pheno_data <- new("AnnotatedDataFrame", data = pheno_data_df)
  1. Create featureData object
feature_data_df <- data.frame(rownames_counts = rownames(gene_expression_cell_ref_transpose_df), row.names = rownames(gene_expression_cell_ref_transpose_df))
colnames(feature_data_df) <-"rownames(counts)"
feature_data <- new("AnnotatedDataFrame", data = feature_data_df)
Screen Shot 2023-08-16 at 1 04 51 pm
ExpressionSet <- new("ExpressionSet",exprs = exprs_matrix, phenoData = pheno_data)
featureData(ExpressionSet) <- feature_data
ExpressionSet

And I can get the following expression set information. Could you please help me to see if I am correct or not?

Screen Shot 2023-08-16 at 1 00 52 pm

Also, I have a question about the reference single-cell RNA sequence dataset. Since I aim to analyse the multiple slices in the human heart dataset. For the location information, I first choose to use slice 0 to simulate, I do not find a reference single-cell RNA sequence dataset for only one slice, such as slice0. So only I can use the whole single-cell RNA sequence dataset for all the slices (0-8) to do the simulation for slice 0. Does it make sense?

Thanks in advance!

YingMa0107 commented 1 year ago

Hi @cristalliao,

The way you constructed the ExpressionSet object looks good to me. Basically, the assayData contains the count matrix and the phenoDate contains the meta information for each cell, for example, what is the cell type annotation for each cell, and the featureData contains information about the genes.

For the simulation, you can use scRNA-seq reference data to generate slice 0. Ideally, if you want to benchmark reference-based deconvolution algorithms, it's better to use one scRNA-seq reference data to simulate the spatial transcriptomics data, and use another scRNA-seq data as a reference to perform the deconvolution. So in CARD simulation framework, we split the reference into two, and use one split to simulate the spatial transcriotomics data and use another split to perform deconvolution.

Best, Ying