Open diala-ar opened 6 months ago
This is odd, and I do see the same thing in xenium explorer with the counts not necessarily corresponding to what's shown.
I'll investigate this some more, but also should note that I don't use and don't necessarily trust the counts that are generated by converting to a xenium bundle and mainly use it just for visualization purposes.
Thanks @dcjones so much for your prompt response! I am assuming we can read the proseg output in an R or python package other than Seurat. Could you please give me some directions, which package to use to read proseg output and perform spatial downstream analyses (I am pretty new to spatial data analyses)? Do you think the problem is at the level of baysor conversion or xenium ranger import? Thanks for your precious help!
Yes, proseg itself will output a number of tables, most importantly:
expected-counts.csv.gz
: Give the cell-by-gene count matrixcell-metadata.csv.gz
: Spatial cell positions and other information.I don't have tutorials for analyzing the output on a specific platform, but you can read them into R or python easily anywhere using the standard tools for reading csv files (e.g. read.csv
in R, pandas.read_csv
in python). I'll probably put up some more detailed tutorials at some point.
Thanks again @dcjones for your help! I was able to read in proseg data into an R Seurat object, I will share the code at the end of this message. Below are some comments and questions concerning proseg results:
What is the difference between x, y, z columns and observed_x, observed_y, observed_z columns in the transcript-metadata.csv.gz file proseg output and which one on these columns should we use?
What does confusion mean in the transcript-metadata.csv.gz file? Is it related to the diffusion part? How should we filter transcripts in this file so the number of reads in a given cell matches the number of reads in the expected_counts.csv.gs file? For instance, I tried the following filtering criteria and the number still does not match: background==0 & confusion==0 & probability>0.6 & gene does not contain 'Unassigned' or 'Control'. When I used the criteria above, I ended up with 886 reads, while this cell is expected to have 1014.662 reads! Does this criteria make sense?
Below is the code I used to read in proseg output into Seurat:
LoadXenium2` = function (proseg_dir, fov = "fov", assay = "Xenium") {
data <- ReadXenium2(proseg_dir = proseg_dir)
segmentations.data <- list(centroids = CreateCentroids(data$centroids),
segmentation = CreateSegmentation(data$segmentations))
coords <- CreateFOV(coords = segmentations.data, type = c("segmentation", "centroids"),
molecules = data$microns, assay = assay)
xenium.obj <- CreateSeuratObject(counts = data$matrix[['GeneExpression']],
assay = assay)
xenium.obj[["BlankCodeword"]] <- CreateAssayObject(counts = data$matrix[["UnassignedCodeword"]])
xenium.obj[["ControlCodeword"]] <- CreateAssayObject(counts = data$matrix[["ControlCodeword"]])
xenium.obj[["ControlProbe"]] <- CreateAssayObject(counts = data$matrix[["ControlProbe"]])
xenium.obj[[fov]] <- coords
return(xenium.obj)
}
ReadXenium2 = function(proseg_dir) {
expected_counts = read.csv(file.path(proseg_dir, 'expected-counts.csv.gz'))
expected_counts = t(expected_counts)
colnames(expected_counts) = paste0('cell-', 0:(ncol(expected_counts)-1))
matrix = list(GeneExpression = expected_counts[-grep('Blank|Unassigned|Control', row.names(expected_counts)), ],
ControlCodeword = expected_counts[grep('ControlCodeword', row.names(expected_counts)), ],
ControlProbe = expected_counts[grep('ControlProbe', row.names(expected_counts)), ],
UnassignedCodeword = expected_counts[grep('UnassignedCodeword', row.names(expected_counts)), ])
cell_meta = read.csv(file.path(proseg_dir, "cell-metadata.csv.gz"))
centroids = data.frame(x = cell_meta$centroid_x,
y = cell_meta$centroid_y,
cell = paste0('cell-', cell_meta$cell))
cell_boundaries_sf = read_sf(file.path(proseg_dir, 'cell-polygons.geojson'))
st_crs(cell_boundaries_sf) = NA
cell_boundaries_df = sfheaders::sf_to_df(cell_boundaries_sf)
cell_boundaries_df = data.frame(cell = paste0('cell-', cell_boundaries_df$sfg_id-1),
x = cell_boundaries_df$x,
y = cell_boundaries_df$y)
tx_dt <- as.data.frame(data.table::fread(file.path(proseg_dir, "transcript-metadata.csv.gz")))
df <- tx_dt[tx_dt$background==0 & tx_dt$confusion==0 & tx_dt$assignment!=4294967295 & tx_dt$probability>0.6, ]
df <- data.frame(x = df$observed_x,
y = df$observed_y,
gene = df$gene)
df = df[-grep('Unassigned|Control', df$gene), ]
list(matrix=matrix, centroids=centroids, segmentations=cell_boundaries_df, microns=df)
}
Thanks.
background
or confusion
being non-zero means the transcript is thought to be noise. There isn't a way to filter this table to exactly match the expected counts, but you can probably get close by weighting each by the probability column. I find that expected counts work well, but there could certainly be a rationale for filtering the transcript table and making your own point estimate. For example, proseg does not know which genes/features are negative controls, so does not automatically filter these out when counting.
Thanks so much for the amazing package! We tried proseg and we can see an clear improvement in cell segmentation of our samples.
After running proseg, I successfully converted proseg output into baysor compatible format and then imported those into xenium ranger as explained in proseg vignette. However, on visualizing proseg data in xenium explorer, a lot of cells appear to have many reads of a specific gene but the cell feature matrix shows 0 count for this gene in those cells, to note that those transcripts have a qv > 20. Any idea why I am getting this discrepancy?
I used xenium ranger v2 (xeniumranger xenium-2.0.0.12) and proseg v1.0.3.
Thanks.