Closed alex-l-kong closed 6 months ago
Looks good, just a few notes:
1_postprocessing_cell_table_updates.py - We might want to keep the percent change in threshold symmetrical. So
5_create_dfs_per_core.py - I think the best thing is to just examine the immediate impact of adjusting the min_cell
threshold on the output files. So checking the decrease in generated features in the files:
morph_df_per_timepoint_filtered_deduped.csv
and morph_df_per_core_filtered_deduped.csv
functional_df_per_timepoint.csv
and functional_df_per_core.csv
_filtered_deduped
here because then we'd need to re-generate the inclusion matrices (inclusion_matrix_broad.csv, inclusion_matrix_med.csv, inclusion_matrix_meta.csv) and exclude_double_positive_markers.csv as welldiversity_df_per_timepoint_filtered_deduped.csv
and diversity_df_per_core_filtered_deduped.csv
distance_df_per_timepoint_deduped.csv
and distance_df_per_core_deduped.csv
The removal of features needs to be stratified by compartments: ['cancer_core', 'cancer_border', 'stroma_core', 'stroma_border', 'all'] (ignore tag and tls, we don't consider those compartments anymore). For a first pass let's just see how many features are lost using new min_cells
values, and if it's a large amount then we can check if any of those features even had a significant p-value in the later scripts.
To summarize:
min_cell
params across the cancer_core
, cancer_border
, stroma_core
, stroma_border
, and all
compartments. Include min_cell = 5
at a 0% change as a baseline for all.
minimum_density
test@ngreenwald any additional features to add, modify, or drop?
I fear we (read: I) may have overcomplicated this lol. We can just visualize this as percent increase or decrease to the original threshold x, rather than a straight multiplier value.
0.5x : 50% decrease 0.75x : 25% decrease 1x : 0% change 1.25x : 25% increase 1.5x : 50% increase 2x : 100% increase
This way we can cover a wide range while maintaining consistent increments.
I fear we (read: I) may have overcomplicated this lol. We can just visualize this as percent increase or decrease to the original threshold x, rather than a straight multiplier value.
0.5x : 50% decrease 0.75x : 25% decrease 1x : 0% change 1.25x : 25% increase 1.5x : 50% increase 2x : 100% increase
This way we can cover a wide range while maintaining consistent increments.
Yeah agreed this makes a whole lot more sense, I was starting to wonder likewise.
Cami, do you have the list of all parameters that are used in the processing pipeline? I want to see if there are any other steps that would be worth including in the first pass
Sure.
min_cells = 5
minimum_density = 0.0005
, minimum_abundance = 0.01
sigma=2
, intensity_thresh=350
, min_mask_size=5000
, max_hole_size=1000
, erosion by 15 pixelsthreshold=0.1
, smooth_val=5
, erode_val=5
), training Kmeans (ecm_fraction< 0.1
, crop size = 256
)Giving 5_create_dfs_per_core.py a second pass though, there are a few things that we could also test if you're concerned about features that get dropped:
mean_percent_positive = 0.05
(corr_1 > 0.7) | (corr_2 > 0.7)
for two markerscorr_vals < 0.7
Okay great. Here's my comments.
I've included comments below for the other features, but I think it would be good to first look just at the functional marker cell table example, generate the plots, and then once we're happy with them, move on to the other feature types
@ngreenwald for visualization 1 (functional marker thresholds), here's a rough draft of how it might look. Aside from sizing, axis, label, gridding, normalization, etc. adjustments:
This looks good, the dots are better than a bar I think. I would keep the multiplier on the x axis like it is, rather than the raw value, but would show the log2 of the ratio.
@ngreenwald for the min_cells
parameter tests:
min_cells
values should we use the deduped or non-deduped version?total_freq
as a feature, however generating this has been commented out in 5_create_dfs_per_core.py
. Should this be excluded for functional?@camisowers for the compartment masks (3_create_image_masks.py
), we should group these by the type of mask being created:
min_mask_size
(default 7000)border_size
(default 50)sigma
(default 4)min_size
(default 25000)area_threshold
(default 7000)sigma
(default 2)intensity_thresh
(default 350)min_mask_size
(default 5000)max_hole_size
(default 1000)We can discuss what metric would be good to measure. Number of cells segmented would be a good, easy one to start with. Open to others.
I can get some starter code going for this, but due to this part of the tuning pipeline being added on later, I most likely will not be able to finalize this visualization before I leave. @camisowers can you finish the remaining portions?
Sure. We don't use the tls or tagg masks any more, so no need to look into that.
We'll need to look specifically at the gold mask and cancer mask generation; for the cancer core / border and stroma core / border we'll need to check if adjusting the 50 pixel border width changes anything as well.
As we discussed, I think there may be more parameters for generating the compartment masks that are hard-coded, would be good to double check.
For the first set of parameter changes, the output can be the % of image that is included in cancer mask, before defining the borders.
Then we can separately show how the number of pixels for the border changes the amount of border mask that results
For the cancer masks:
sigma=10
, min_mask_size=0
, max_hole_size=100000
, intensity_thresh=0.3
to mask the Cancer cells
intensity_thresh
does anything here since the cell values are all ints sigma=10
, min_mask_size=7000
, max_hole_size=1000
, intensity_thresh=0.0015
to mask the ECAD signal (this is combined with the cell mask detailed above using np.logical_or() to create the cancer mask)Compartment regions:
border_size=50
to stratify into cancer core, cancer border, stroma border, stroma core
Relevant background
Certain parameters throughout the TNBC pipeline affect the data generated in different ways. We wish to experiment which ones contribute the most to the final outputs.
Design overview
_1_postprocessing_cell_tableupdates.py
Each functional marker is set to a certain threshold; any cells at or above that value is marked as positive for the corresponding marker. Within a certain range window around each threshold, we wish to see how changes affect the number of cells marked positive for each marker.
For each functional marker, we test changes at each of the following values:
The plot we'll be creating is the percent change of cells marked as positive for each marker at each multiplier. At 1x, the percent change will always be 0.
_5_create_dfs_percore.py
During generation of each of the functional, morphology, diversity, and distance features per core datasets, a
min_cell
parameter is used to select only FOVs within each metric/compartment/cell_type grouping that have high enough counts to be included. Themin_cell
baseline is 5.Because a different number of FOVs may be subsetted for each grouping, the total number of features for functional, morphology, diversity, and distance may change. The visualization we show for
min_cell
gains/losses can be across the entire datasets for all four, or it can be on a compartment, metric, and/or cell-type level. Either way, each visualization should follow the standard protocol of x-axis = the different parameter values tried, y-axis = the percent change observed from the baseline.Additionally, a corresponding timepoint dataset is also generated for both functional, morphology, diversity, and distance as a result of merging with the
harmonized_metadata
dataset. Some analysis may want to be conducted here._6_create_fovstats.py
For both the broad (
cluster_broad_density
) and specific (cluster_density
) cell type abundance features, theminimum_density
parameter is used to select which FOVs have at least one cell type above (non-inclusive) this value. For each cell type pair, generate data if a FOV for at least one of these cell types have corresponding feature values above theminimum_density
threshold.*As with
5_create_dfs_per_core.py
, this will affect the number of FOVs selected during the feature generation. Additionally, this will affect the'ratio'
feature generated, asminimum_density
is added to both the numerator and denominator to prevent taking alog_2
of 0.As with
5_create_dfs_per_core.py
, we can either visualize the percent change in the total number of features computed, or we can do so on an individual feature level. There are several different pairs of cells to consider (especially forcluster_broad_density
), so more thought will be needed if the latter method is chosen.Additionally, the percent change in the
'ratio'
values computed should also be considered. Open to ideas on the best ways to do this.*It is possible that
minimum_density
will be deprecated at some point in favor of a fixed value. In which case, discard this particular part of the tuning tests.Code mockup
The code for these sections will be fairly straightforward and each follow a general process:
1_postprocessing_cell_table_updates.py
,5_create_dfs_per_core.py
, or6_create_fov_stats.py
to recreate the data generation stage depending on the parameter(s) in questiondict
,pd.DataFrame
, etc.).Required inputs
As defined in
1_postprocessing_cell_table_updates.py
,5_create_dfs_per_core.py
, and6_create_fov_stats.py
.Output files
The graphs visualizing the percent change in the metrics computed.
Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.
Estimated date when a fully implemented version will be ready for review:
Estimated date when the finalized project will be merged in: