Experiment with pre-processing steps

ngreenwald commented 1 year ago

Once we have a trained, multi-dataset model that performs reasonably well, I think it will be worth exploring how different preprocessing effects inference. For example, if we just max-normalize the images, how is performance? If you intentionally scale different markers to have different mean/variance, how does the model perform?

Basically, before we give users the notebook to try out on their own data, we should do some adversarial testing to see how important the 99.9th percentile normalization step is.

JLrumberger commented 1 year ago

Look at channels without positive signal and use the worst normalization technique (min-max) and to see if the model is robust against this.
try some different normalization setups and compare scores
- worst case: single image 0.999 percentile
- second worst: tonic testset 0.999 percentile
- normalization dict from the whole cohort

JLrumberger commented 1 year ago

Quantitative On top you find the f1 scores for each of the three setups, below are f1 scores split by marker for the best setup (full dataset 0.999 normalization) vs. the worst setup (single image 0.999 normalization). For Calprotection, Chymase Tryptase and Fibronectin we can observe the biggest performance gap. On visual inspection these markers look sparse, noisy and not super specific, so it makes sense that per image normalization has a bigger impact here I'd say. By the way, the metrics are calculated on the stitched predictions, not the crops and there's not much of an performance gap anymore.

Qualitative Below you see an examples of Fibronectin and ChyTr per image normalization and no/few positive cells. The difference in the predictions seems to be minimal. Here are the same samples with dataset-wise normalization

ngreenwald commented 1 year ago

That's great! It seems to me like having even a couple positive cells in the image is enough for the normalization to work properly, since the top end will get scaled correctly. What if you take a small crop of an image where there is zero positive signal? I bet it would be easy to find a couple 256x256 crops that have absolutely no signal from either CD4, CD45, ECAD, or FOXP3.

If it performs well even in that case, then that definitely simplifies our life!

JLrumberger commented 1 year ago

Here are a few samples of markers without positive cells predicted with per image norm. Unfortunately the images I looked at all had some ECAD+ or CD4+ cells CK17 FOXP3

ngreenwald commented 1 year ago

Great! FOXP3 is a worst-case scenario marker, so the fact that there are some false positives is okay, since it looks like real staining.

It seems like cohort-wide 99% normalization is still the best, but not by much. Maybe we can have an option for people to compute it on a subset of their data if they want, but if they want to dive right in we can do per-image normalization and not worry that there's much of a performance difference?

I think for now this is definitely promising enough to move forward with what we have?

JLrumberger commented 1 year ago

Yep, I think we can move on. I'll add sampling and multi-processing to the function that calculates the normalization dictionary.

angelolab / Nimbus

Experiment with pre-processing steps #53