Feature extraction parameter tuning tests

alex-l-kong commented 8 months ago

Relevant background

Certain parameters throughout the TNBC pipeline affect the data generated in different ways. We wish to experiment which ones contribute the most to the final outputs.

Design overview

_1_postprocessing_cell_tableupdates.py

Each functional marker is set to a certain threshold; any cells at or above that value is marked as positive for the corresponding marker. Within a certain range window around each threshold, we wish to see how changes affect the number of cells marked positive for each marker.

For each functional marker, we test changes at each of the following values:

0.5x marker threshold
0.75x marker threshold
1x marker threshold (baseline)
1.5x marker threshold
2x marker threshold (TODO: look into testing 1.25x and 1.5x instead to be symmetrical)

The plot we'll be creating is the percent change of cells marked as positive for each marker at each multiplier. At 1x, the percent change will always be 0.

_5_create_dfs_percore.py

During generation of each of the functional, morphology, diversity, and distance features per core datasets, a min_cell parameter is used to select only FOVs within each metric/compartment/cell_type grouping that have high enough counts to be included. The min_cell baseline is 5.

Because a different number of FOVs may be subsetted for each grouping, the total number of features for functional, morphology, diversity, and distance may change. The visualization we show for min_cell gains/losses can be across the entire datasets for all four, or it can be on a compartment, metric, and/or cell-type level. Either way, each visualization should follow the standard protocol of x-axis = the different parameter values tried, y-axis = the percent change observed from the baseline.

Additionally, a corresponding timepoint dataset is also generated for both functional, morphology, diversity, and distance as a result of merging with the harmonized_metadata dataset. Some analysis may want to be conducted here.

_6_create_fovstats.py

For both the broad (cluster_broad_density) and specific (cluster_density) cell type abundance features, the minimum_density parameter is used to select which FOVs have at least one cell type above (non-inclusive) this value. For each cell type pair, generate data if a FOV for at least one of these cell types have corresponding feature values above the minimum_density threshold.*

As with 5_create_dfs_per_core.py, this will affect the number of FOVs selected during the feature generation. Additionally, this will affect the 'ratio' feature generated, as minimum_density is added to both the numerator and denominator to prevent taking a log_2 of 0.

As with 5_create_dfs_per_core.py, we can either visualize the percent change in the total number of features computed, or we can do so on an individual feature level. There are several different pairs of cells to consider (especially for cluster_broad_density), so more thought will be needed if the latter method is chosen.

Additionally, the percent change in the 'ratio' values computed should also be considered. Open to ideas on the best ways to do this.

*It is possible that minimum_density will be deprecated at some point in favor of a fixed value. In which case, discard this particular part of the tuning tests.

Code mockup

The code for these sections will be fairly straightforward and each follow a general process:

Use the process defined in either 1_postprocessing_cell_table_updates.py, 5_create_dfs_per_core.py, or 6_create_fov_stats.py to recreate the data generation stage depending on the parameter(s) in question
Use for-loops to grid search over the parameter(s) in question. Collect the corresponding metric(s) for each dataset and add to a data structure that can be easily indexed (dict, pd.DataFrame, etc.).
Visualize the percent change in the metric computed at different parameter values. A simple scatter or line plot should do. Potential faceting required if doing multiple

Required inputs

As defined in 1_postprocessing_cell_table_updates.py, 5_create_dfs_per_core.py, and 6_create_fov_stats.py.

Output files

The graphs visualizing the percent change in the metrics computed.

Timeline Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

[ ] A couple days
[ ] A week
[x] Multiple weeks. For large projects, make sure to agree on a plan that isn't just a single monster PR at the end.

Estimated date when a fully implemented version will be ready for review:

Estimated date when the finalized project will be merged in:

camisowers commented 8 months ago

Looks good, just a few notes:

1_postprocessing_cell_table_updates.py - We might want to keep the percent change in threshold symmetrical. So
- 0.5x marker threshold
- 0.75x marker threshold
- 1x marker threshold (baseline)
- ** 1.33x marker threshold (=1/0.75) we don't want people to see a larger difference between 1 - 1.5 than 0.75 - 1 and be concerned
- 2x marker threshold (=1/0.5)
5_create_dfs_per_core.py - I think the best thing is to just examine the immediate impact of adjusting the min_cell threshold on the output files. So checking the decrease in generated features in the files:
- morph_df_per_timepoint_filtered_deduped.csv and morph_df_per_core_filtered_deduped.csv
- functional_df_per_timepoint.csv and functional_df_per_core.csv
  - not sure if we want to use _filtered_deduped here because then we'd need to re-generate the inclusion matrices (inclusion_matrix_broad.csv, inclusion_matrix_med.csv, inclusion_matrix_meta.csv) and exclude_double_positive_markers.csv as well
- diversity_df_per_timepoint_filtered_deduped.csv and diversity_df_per_core_filtered_deduped.csv
- distance_df_per_timepoint_deduped.csv and distance_df_per_core_deduped.csv

The removal of features needs to be stratified by compartments: ['cancer_core', 'cancer_border', 'stroma_core', 'stroma_border', 'all'] (ignore tag and tls, we don't consider those compartments anymore). For a first pass let's just see how many features are lost using new min_cells values, and if it's a large amount then we can check if any of those features even had a significant p-value in the later scripts.

6_create_fov_stats.py - I think we'll definitely be dropping minimum density as a threshold and instead simply keep it as an addition to the calculation to avoid log(0). So let's focus on the other two analyses for now.

alex-l-kong commented 8 months ago

To summarize:

For thresholds at 0.5x, 0.75x, 1x, 1.33x, and 2x for each individual functional marker, visualize the percent change from the baseline in percentage of cells marked as positive (for 1x, the percent change is always 0%). Plan to do this across all functional markers defined in script 1.
For the functional, morph, diversity, and distance dataframes by core and timepoint, visualize the percent change in number of features at different min_cell params across the cancer_core, cancer_border, stroma_core, stroma_border, and all compartments. Include min_cell = 5 at a 0% change as a baseline for all.
- The deduped feature DF will be used for counting num features, except for possibly functional (just the non-deduped)
Drop the minimum_density test

@ngreenwald any additional features to add, modify, or drop?

camisowers commented 8 months ago

I fear we (read: I) may have overcomplicated this lol. We can just visualize this as percent increase or decrease to the original threshold x, rather than a straight multiplier value.

0.5x : 50% decrease 0.75x : 25% decrease 1x : 0% change 1.25x : 25% increase 1.5x : 50% increase 2x : 100% increase

This way we can cover a wide range while maintaining consistent increments.

alex-l-kong commented 8 months ago

I fear we (read: I) may have overcomplicated this lol. We can just visualize this as percent increase or decrease to the original threshold x, rather than a straight multiplier value.

0.5x : 50% decrease 0.75x : 25% decrease 1x : 0% change 1.25x : 25% increase 1.5x : 50% increase 2x : 100% increase

This way we can cover a wide range while maintaining consistent increments.

Yeah agreed this makes a whole lot more sense, I was starting to wonder likewise.

ngreenwald commented 8 months ago

Cami, do you have the list of all parameters that are used in the processing pipeline? I want to see if there are any other steps that would be worth including in the first pass

camisowers commented 8 months ago

Sure.

5_create_dfs_per_core: min_cells = 5
6_create_fov_stats: minimum_density = 0.0005, minimum_abundance = 0.01
3_create_image_masks: sigma=2, intensity_thresh=350, min_mask_size=5000, max_hole_size=1000, erosion by 15 pixels
4_ecm_preprocessing: ECM masks (threshold=0.1, smooth_val=5, erode_val=5), training Kmeans (ecm_fraction< 0.1, crop size = 256)

camisowers commented 8 months ago

Giving 5_create_dfs_per_core.py a second pass though, there are a few things that we could also test if you're concerned about features that get dropped:

the inclusion matrices use mean_percent_positive = 0.05
exclude_double_positive_markers.csv checks (corr_1 > 0.7) | (corr_2 > 0.7) for two markers
distance_df_keep_features.csv uses corr_vals < 0.7

ngreenwald commented 8 months ago

Okay great. Here's my comments.

For the first pass, instead of plotting the % change in the target metric compared to default, I think we should just plot the raw value. We'll likely want to do some kind of transformation in the end for visualizations, but at first I think it'll be more useful to look at, for example, the number of cells positive for each marker at each threshold, rather than % change in positive cells.
For functional marker thresholding, I don't think we need to break it down by cell type, like I had suggested earlier. We can have a single plot for each functional marker, looking across all cells in the dataset, showing the change in positive cells. For which specific values to check, I think it does make sense to have the change in this feature value by multiplicatively symmetrical. I would look at 1/4, 1/2, 3/4, 7/8, 1, 8/7, 4/3, 2, and 4x change in value.

I've included comments below for the other features, but I think it would be good to first look just at the functional marker cell table example, generate the plots, and then once we're happy with them, move on to the other feature types

For the min_cells argument across features, let's focus just on the 'all' compartment. Here we want the output to be the number of FOVs that were excluded. I think we should do all of this at the FOV-level, no need to look at timepoints. We should calculate this separately for each feature. For a first pass, I think it's okay to plot them all together. So this would mean a boxplot or violinplot or something for each of the increments of min cells, where each dot is a specific feature. I would try min_cells = 1, 3, 5, 10, 20.
Skip the minimum_density impact on the ratio features
I would add in a section on the compartment masks, since these are used by so many other features. For now, let's do all of the ones you listed, we may end up including only a subset in the paper, but will be interesting to see.

alex-l-kong commented 7 months ago

@ngreenwald for visualization 1 (functional marker thresholds), here's a rough draft of how it might look. Aside from sizing, axis, label, gridding, normalization, etc. adjustments:

Is a bar graph preferred?
Debating over the best way to show the thresholds chosen, along with the baseline used. For now, I've decided to include the 1x baseline in the title, and have tick labels indicate the specific threshold multiplier. Is it preferred to show the raw threshold values along the x-axis instead?

functional_marker_threshold_experiments

ngreenwald commented 7 months ago

This looks good, the dots are better than a bar I think. I would keep the multiplier on the x axis like it is, rather than the raw value, but would show the log2 of the ratio.

alex-l-kong commented 7 months ago

@ngreenwald for the min_cells parameter tests:

By altogether, do you mean the number of FOVs dropped per feature across the functional, morph, diversity, and distance dataframes combined. Or one plot each for functional, morph, diversity, and distance?
For the functional DataFrame, per @camisowers inquiry, when replicating for different min_cells values should we use the deduped or non-deduped version?
For the functional DataFrame, it contains total_freq as a feature, however generating this has been commented out in 5_create_dfs_per_core.py. Should this be excluded for functional?

alex-l-kong commented 7 months ago

@camisowers for the compartment masks (3_create_image_masks.py), we should group these by the type of mask being created:

_cancermask
- min_mask_size (default 7000)
- border_size (default 50)
_tls_labelmask (I think we can probably skip these since we don't really use them anymore)
- sigma (default 4)
- min_size (default 25000)
- area_threshold (default 7000)
_goldmask
- sigma (default 2)
- intensity_thresh (default 350)
- min_mask_size (default 5000)
- max_hole_size (default 1000)

We can discuss what metric would be good to measure. Number of cells segmented would be a good, easy one to start with. Open to others.

I can get some starter code going for this, but due to this part of the tuning pipeline being added on later, I most likely will not be able to finalize this visualization before I leave. @camisowers can you finish the remaining portions?

camisowers commented 7 months ago

Sure. We don't use the tls or tagg masks any more, so no need to look into that.

We'll need to look specifically at the gold mask and cancer mask generation; for the cancer core / border and stroma core / border we'll need to check if adjusting the 50 pixel border width changes anything as well.

ngreenwald commented 7 months ago

As we discussed, I think there may be more parameters for generating the compartment masks that are hard-coded, would be good to double check.

For the first set of parameter changes, the output can be the % of image that is included in cancer mask, before defining the borders.

Then we can separately show how the number of pixels for the border changes the amount of border mask that results

camisowers commented 7 months ago

For the cancer masks:

create_cell_mask() uses sigma=10, min_mask_size=0, max_hole_size=100000, intensity_thresh=0.3 to mask the Cancer cells
- don't think intensity_thresh does anything here since the cell values are all ints
create_cancer_boundary() uses sigma=10, min_mask_size=7000, max_hole_size=1000, intensity_thresh=0.0015 to mask the ECAD signal (this is combined with the cell mask detailed above using np.logical_or() to create the cancer mask)

Compartment regions:

create_cancer_boundary() uses border_size=50 to stratify into cancer core, cancer border, stroma border, stroma core

angelolab / TNBC_python_scripts

Feature extraction parameter tuning tests #36