This code was written to handle either comparisons between FOVs or comparisons between timepoints. However, since we're only going to be looking at timepoints, may be simpler to take these different functions and rewrite them.
The output should be a dataframe which has, for each patient, the distance between primary and baseline, baseline and post_induction, and post_induction and on_nivo
[x] Vary feature extraction parameters (@alex-l-kong and @camisowers)
We have many places in the feature extraction pipeline where specific decisions were made. For example, thresholds for generating the masks, thresholds for calculating cell ratios, thresholds for determining functional marker positivity, etc.
We want to show which choices we made had a large impact on the results, and which did not.
The first step will be putting together a list of all the places where these types of decisions are made, which Cami has already started to work on.
Once this list has been constructed, the second step is to systematically change the thresholds and see how it impacts the immediate output of that step.
[x] Cell types by timepoint (@camisowers)
Same as by tissue site, but primary/baseline/post_induction/on_nivo
sns.despine()
[x] #74
Since we are using slightly different cutoffs for significance in different parts of the codebase, we want to have an empirical estimate of our false positive rate. The easiest way to do this is to shuffle the patient labels, and rerun the comparisons to see what hits come up.
We can do this for the MIBI features associated with response, as well as the genomics features
[x] Correlation of response-associated features (@jranek)
For top features, do they tend to correlate with one another?
Do they correlate more than features of the same type that aren't response-associated?
Do they correlate more than any random of set image features?
Are there interesting patterns/relationships between them
[x] Outlier patients (@jranek)
Look at patients who have unexpected values for top associated features. For example, non-responders who have high values for features associated with response, responders who have low values for features associated with response, etc.
Do those patients overall have discordant feature values across multiple features?
Are they more extreme in other features to "compensate" for unexpected features?
Or is it random, and there's no clear pattern
[x] Neighborhood analyses (@jranek and @camisowers )
We currently do not have any neighborhood analysis as part of the feature pipeline. I looked briefly at using the neighborhood notebook, but the results didn't look great. However, I spent really minimal time on it.
Even if the results end up not being great, I think it's worth having as a feature in the paper to say we did it, and didn't see any strong associations with outcome
We should run neighborhood analysis on the whole cohort, play around the parameters to see what looks the best, and then add this in as a feature.
[ ] Multivariate modeling to predict outcome (@jranek)
Daisy has put together a lasso model to predict outcome per timepoint. However, there are many different approaches to do this task. It would be good to confirm that we see the same patterns using a different approach, and would also allow us to compare them against each other.
[x] Differences in significant features by timepoint (@jranek)
We know that different features came up as significant at different timepoints. However, graphically depicting that has been a challenge
One idea is to plot some measure of significance, effect size, ranking, etc for each feature on the y-axis. Then the x-axis would be the four different timepoints. Using a connected lineplot, this would then show the trajectory of a subset of the different features over time.
Alternatively, we could make a heatmap with features in the rows, timepoints in the columns, and each box showing effect size, significance, etc.
I've included a couple different example heatmaps I tried before. The challenge with all of these was figuring out what measure to use to show importance/significance. Can discuss the drawbacks of using importance score or ranking as a raw value, and if there's other changes to potentially make to this stat. Here's the code that was used to generate them