Nanostring-Biostats / FastReseg

FastReseg for detection and correction of cell segmentation error based on spatial profile of transcripts.
https://nanostring-biostats.github.io/FastReseg/
Other
2 stars 0 forks source link

wrapper for whole pipeline with per FOV file as inputs #3

Closed lidanwu closed 2 years ago

lidanwu commented 2 years ago

Finally, here is the wrapper that could do automatic baseline estimation, take the file path to multiple per FOV transcript files as input and save per FOV cell segmentation outcomes into disk.

Please find the examples for this fastReseg_internalRef function below:

# get example based on example dataset
data("transcriptDF")
data("ori_RawExprs")
data("refProfiles")
data("baselineCT")
extracellular_cellID <- transcriptDF[which(transcriptDF$CellId ==0), 'cell_ID'] # cell_ID for extracellualr transcripts

# case #1: provide `transcript_df` directly, 
# do auto-calculation of distance cutoff from data while using the provided cutoffs for score and transcript numbers. 
res1 <- fastReseg_internalRef(counts = ori_RawExprs, 
                              clust = NULL, 
                              refProfiles = refProfiles,
                              pixel_size = 1, 
                              zstep_size = 1, 
                              transcript_df = transcriptDF, 
                              transID_coln = "transcript_id", 
                              transGene_coln = "target", 
                              cellID_coln = "cell_ID",
                              spatLocs_colns = c("x","y","z"), 
                              extracellular_cellID = extracellular_cellID, 
                              molecular_distance_cutoff = NULL,
                              cellular_distance_cutoff = NULL,
                              score_baseline = baselineCT[['score_baseline']], 
                              lowerCutoff_transNum = baselineCT[['lowerCutoff_transNum']], 
                              higherCutoff_transNum= baselineCT[['higherCutoff_transNum']],
                              imputeFlag_missingCTs = TRUE,
                              path_to_output = "res1_directDF")

# case #2: provide file paths to per FOV transcript data files and specify the spatial offset for each FOV, 
# do auto-calculation of score and transcript number cutoffs from gene expression matrix, `counts`, and cluster assignment of each cell, `clust`, 
# do auto-calculation of distance cutoff from the 1st per FOV transcript data. 
data("counts")
data("clust")
# the example individual transcript files are stored under `data` directory of this package
# update your path accordingly
# Notice that some assays like SMI has XY axes swapped between stage and each FOV;
# coordinates for each FOV should have units in micron
fileInfo_DF <- data.frame(file_path = c("data/Run4104_FOV001__complete_code_cell_target_call_coord.csv", 
                                        "data/Run4104_FOV002__complete_code_cell_target_call_coord.csv"), 
                          slide = c(1, 1), 
                          fov = c(1,2), 
                          stage_X = 1000*c(5.13, -2.701), 
                          stage_Y = 1000*c(-0.452, 0.081))

res2 <- fastReseg_internalRef(counts = counts, 
                              clust = clust, 
                              refProfiles = NULL,
                              transDF_fileInfo =fileInfo_DF, 
                              filepath_coln = 'file_path', 
                              prefix_colns = c('slide','fov'), 
                              fovOffset_colns = c('stage_Y','stage_X'), # match XY axes between stage and each FOV
                              pixel_size = 0.18, # 0.18 micron per pixel in transcript data
                              zstep_size = 0.8, # 0.8 micron per z step in transcript data
                              transcript_df = NULL, 
                              transID_coln = NULL, # row index as transcript_id
                              transGene_coln = "target", 
                              cellID_coln = "CellId",
                              spatLocs_colns = c("x","y","z"), 
                              extracellular_cellID = c(0), # CellId = 0 means extracelluar transcripts in raw data 
                              molecular_distance_cutoff = NULL,
                              cellular_distance_cutoff = NULL,
                              score_baseline = NULL, 
                              lowerCutoff_transNum = NULL, 
                              higherCutoff_transNum= NULL,
                              imputeFlag_missingCTs = TRUE,
                              path_to_output = "res2_multiFiles")

# case #3: provide file paths to per FOV transcript data files and specify the spatial offset for each FOV, 
# do auto-calculation of score and transcript number cutoffs from gene expression matrix, `counts`, and cluster-specific reference profiles, `refProfiles`, 
# use the provided distance cutoff for `molecular_distance_cutoff` but calculate the `cellular_distance_cutoff`
res3 <- fastReseg_internalRef(counts = counts, 
                              clust = NULL, 
                              refProfiles = refProfiles,
                              transDF_fileInfo =fileInfo_DF, 
                              filepath_coln = 'file_path', 
                              prefix_colns = c('slide','fov'), 
                              fovOffset_colns = c('stage_Y','stage_X'), # match XY axes between stage and each FOV
                              pixel_size = 0.18, # 0.18 micron per pixel in transcript data
                              zstep_size = 0.8, # 0.8 micron per z step in transcript data
                              transcript_df = NULL, 
                              transID_coln = NULL, # row index as transcript_id
                              transGene_coln = "target", 
                              cellID_coln = "CellId",
                              spatLocs_colns = c("x","y","z"), 
                              extracellular_cellID = c(0), # CellId = 0 means extracelluar transcripts in raw data 
                              molecular_distance_cutoff = 2.7,
                              cellular_distance_cutoff = NULL,
                              score_baseline = NULL, 
                              lowerCutoff_transNum = NULL, 
                              higherCutoff_transNum= NULL,
                              imputeFlag_missingCTs = TRUE,
                              path_to_output = "res3_multiFiles")
lidanwu commented 2 years ago

@patrickjdanaher I have created a function to preprocess SMI data and the corresponding vigentte to perform resegmentation process on SMI raw data from end-to-end manner. To use this vignette, user would only need to change the file path to pre-built SMI giotto object and provide the corresponding config_loading.

I have applied this to the TAP data SMI003-MGH you mentioned and the post-resegementation data is stored at \\freya\NAS\lwu\testRun\SpatialTest\giotto_test\SMITAP_proj\SMI003_DavidTing_MGH\fastReseg01

Here I plotted the processing speed for each FOV as a function of the file size of final updated transcript data.frame when running on Dante container. (There's ~15% file size reduction after resegmentation due to the removal of unnecessary columns and extracellular transcripts). As expected, bigger transcript data.frame would take longer time to process (~8.33 MB per min) FastReseg_process_speed

The next thing is to enable visualization of resegmentation impacts. Please let me know if you have any comments and suggestions.

BTW, David Ross did report an issue of running this package when python is not under "/usr/bin/python3". This is because I use Giotto function for leiden clustering and Delaunay network generation. I'm not sure what the best way to address this issue. Please let me know if you have any suggestion.

lidanwu commented 2 years ago

@patrickjdanaher this PR is ready for review again. Thanks.