converting proseg output into xenium ranger compatible format distorted data

diala-ar commented 2 months ago

Thanks so much for the amazing package! We tried proseg and we can see an clear improvement in cell segmentation of our samples.

After running proseg, I successfully converted proseg output into baysor compatible format and then imported those into xenium ranger as explained in proseg vignette. However, on visualizing proseg data in xenium explorer, a lot of cells appear to have many reads of a specific gene but the cell feature matrix shows 0 count for this gene in those cells, to note that those transcripts have a qv > 20. Any idea why I am getting this discrepancy?

I used xenium ranger v2 (xeniumranger xenium-2.0.0.12) and proseg v1.0.3.

Thanks.

dcjones commented 2 months ago

This is odd, and I do see the same thing in xenium explorer with the counts not necessarily corresponding to what's shown.

I'll investigate this some more, but also should note that I don't use and don't necessarily trust the counts that are generated by converting to a xenium bundle and mainly use it just for visualization purposes.

diala-ar commented 1 month ago

Thanks @dcjones so much for your prompt response! I am assuming we can read the proseg output in an R or python package other than Seurat. Could you please give me some directions, which package to use to read proseg output and perform spatial downstream analyses (I am pretty new to spatial data analyses)? Do you think the problem is at the level of baysor conversion or xenium ranger import? Thanks for your precious help!

dcjones commented 1 month ago

Yes, proseg itself will output a number of tables, most importantly:

expected-counts.csv.gz: Give the cell-by-gene count matrix
cell-metadata.csv.gz: Spatial cell positions and other information.

I don't have tutorials for analyzing the output on a specific platform, but you can read them into R or python easily anywhere using the standard tools for reading csv files (e.g. read.csv in R, pandas.read_csv in python). I'll probably put up some more detailed tutorials at some point.

diala-ar commented 1 month ago

Thanks again @dcjones for your help! I was able to read in proseg data into an R Seurat object, I will share the code at the end of this message. Below are some comments and questions concerning proseg results:

I noticed some differences between proseg and Xenium bundle output generated from proseg data. Boundaries of cells taken from proseg output are very similar to those taken from Xenium output and sometimes identical. In other cases, as in the examples below, proseg cells boundaries show some extra regions (encircled in red) when compared to Xenium cell boundaries. What is the source of this difference? And should we use proseg or Xenium bundle output?

What is the difference between x, y, z columns and observed_x, observed_y, observed_z columns in the transcript-metadata.csv.gz file proseg output and which one on these columns should we use?
What does confusion mean in the transcript-metadata.csv.gz file? Is it related to the diffusion part? How should we filter transcripts in this file so the number of reads in a given cell matches the number of reads in the expected_counts.csv.gs file? For instance, I tried the following filtering criteria and the number still does not match: background==0 & confusion==0 & probability>0.6 & gene does not contain 'Unassigned' or 'Control'. When I used the criteria above, I ended up with 886 reads, while this cell is expected to have 1014.662 reads! Does this criteria make sense?

Below is the code I used to read in proseg output into Seurat:

LoadXenium2` = function (proseg_dir, fov = "fov", assay = "Xenium") {
  data <- ReadXenium2(proseg_dir = proseg_dir)
  segmentations.data <- list(centroids = CreateCentroids(data$centroids), 
                             segmentation = CreateSegmentation(data$segmentations))
  coords <- CreateFOV(coords = segmentations.data, type = c("segmentation", "centroids"), 
                      molecules = data$microns, assay = assay)
  xenium.obj <- CreateSeuratObject(counts = data$matrix[['GeneExpression']], 
                                   assay = assay)
  xenium.obj[["BlankCodeword"]] <- CreateAssayObject(counts = data$matrix[["UnassignedCodeword"]])
  xenium.obj[["ControlCodeword"]] <- CreateAssayObject(counts = data$matrix[["ControlCodeword"]])
  xenium.obj[["ControlProbe"]] <- CreateAssayObject(counts = data$matrix[["ControlProbe"]])
  xenium.obj[[fov]] <- coords
  return(xenium.obj)
}

ReadXenium2 = function(proseg_dir) {

  expected_counts = read.csv(file.path(proseg_dir, 'expected-counts.csv.gz'))
  expected_counts = t(expected_counts)
  colnames(expected_counts) = paste0('cell-', 0:(ncol(expected_counts)-1))
  matrix = list(GeneExpression = expected_counts[-grep('Blank|Unassigned|Control', row.names(expected_counts)), ],
                ControlCodeword = expected_counts[grep('ControlCodeword', row.names(expected_counts)), ],
                ControlProbe = expected_counts[grep('ControlProbe', row.names(expected_counts)), ],
                UnassignedCodeword = expected_counts[grep('UnassignedCodeword', row.names(expected_counts)), ])

  cell_meta = read.csv(file.path(proseg_dir, "cell-metadata.csv.gz"))
  centroids = data.frame(x = cell_meta$centroid_x, 
                         y = cell_meta$centroid_y, 
                         cell = paste0('cell-', cell_meta$cell))

  cell_boundaries_sf = read_sf(file.path(proseg_dir, 'cell-polygons.geojson'))
  st_crs(cell_boundaries_sf) = NA
  cell_boundaries_df = sfheaders::sf_to_df(cell_boundaries_sf)
  cell_boundaries_df = data.frame(cell = paste0('cell-', cell_boundaries_df$sfg_id-1), 
                                  x = cell_boundaries_df$x,
                                  y = cell_boundaries_df$y)

  tx_dt <- as.data.frame(data.table::fread(file.path(proseg_dir, "transcript-metadata.csv.gz")))
  df <- tx_dt[tx_dt$background==0 & tx_dt$confusion==0 & tx_dt$assignment!=4294967295 & tx_dt$probability>0.6, ]
  df <- data.frame(x = df$observed_x, 
                   y = df$observed_y, 
                   gene = df$gene)
  df = df[-grep('Unassigned|Control', df$gene), ]

  list(matrix=matrix, centroids=centroids, segmentations=cell_boundaries_df, microns=df)
}

Thanks.

dcjones commented 1 month ago

I think this is due to xenium explorer not supporting cells being multi-polygons, which is the case when you have these voxels that are only connected along the diagonal.
By default, proseg will do transcript repositioning. This is described in the paper, but the idea is to try to slightly adjust positions of some transcripts. So the "observed" columns will correspond to the input data, and the x/y/z columns will correspond to where the transcript was repositioned by proseg.
"confusion" is poorly named, but it's part of the noise model in proseg. So either background or confusion being non-zero means the transcript is thought to be noise. There isn't a way to filter this table to exactly match the expected counts, but you can probably get close by weighting each by the probability column. I find that expected counts work well, but there could certainly be a rationale for filtering the transcript table and making your own point estimate. For example, proseg does not know which genes/features are negative controls, so does not automatically filter these out when counting.

dcjones / proseg

converting proseg output into xenium ranger compatible format distorted data #11