FeatureRequest: Automated identification and removal of non-biologically variant features

cytomining / pycytominer

Python package for processing image-based profiling data

https://pycytominer.readthedocs.io

BSD 3-Clause "New" or "Revised" License

80 stars 36 forks source link

FeatureRequest: Automated identification and removal of non-biologically variant features #452

Open hwarden162 opened 1 month ago

hwarden162 commented 1 month ago

Feature type

[X] Add new functionality
[ ] Change existing functionality

General description of the proposed functionality

There are features measured during morphological profiling that are dependent on the positioning or rotation of the microscope. Simple examples of this are centroids and orientation measurements. Other examples would include measurements on bounding boxes, the image below shows how the bounding box area of a cell changes under rotation of the microscope.

Taking CellProfiler as an example, there are multiples of these measurements. When used for machine learning or statistical analysis they introduce technical noise and can contribute to batch effect and data leakage.

Feature example

I have a trial solution of this that requires the user to specify what software was used to generate their measurements and then iterates over feature names matching the patterns of variant features that have been identified manually. My solution extends feature_select like this

from pycytominer import feature_select

non_variant = feature_select(normalized_df, operation="drop_non_bio_variant", drop_non_bio_variant_data_source="cellprofiler")

Alternative Solutions

No response

Additional information

No response

hwarden162 commented 1 month ago

Example of rotationally variant feature. We have manually identified many translationally and rotationally variant features in the output of CellProfiler.

d33bs commented 1 month ago

Thank you @hwarden162 for opening this issue and for the description of the enhancement! Looking into this further, it seems like this capability might relate to a parameter called blocklist_file within pycyotminer.feature_select() which provides the option to use a default or custom list of features to be excluded from analyses. The default blocklist feature names may be found here: https://github.com/cytomining/pycytominer/blob/main/pycytominer/data/blocklist_features.txt . Does the existing list of feature names miss anything? Maybe it'd be useful to update the list based on what you've found.

Additionally, based on your example rotationally variant feature and overall experience with these types of features, would it be possible to make this kind of calculation for any feature - perhaps enabling an automatic method of uncovering features which should be excluded?

hwarden162 commented 1 month ago

There's no way of calculating this on a dataset, but you can via experimentation. I have some python code that generates "synthetic" cells (just 2 channel jpg's) of an ellipse and a circle and then rotates and translates these and saves them as files. I then use CellProfiler to analyse these and plot the feature against rotation and translation. However, this will then miss some features like the texture module which is also rotationally variant but only up to ordering of the directional features. In this case these can be transformed to be rotationally invariant by taking the mean, std, min and max which contain all the information in a rotationally invariant manner. There are also other features that are rotationally variant but can be transformed to be rotationally invariant (like centre of mass for stains is translationally variant but can be converted to being invariant by converting it to be the distance of the center of mass of the stain from the center of mass of the object mask).

I can update the blocklist with the rotationally variant features I've identified, but the question also then becomes do you blocklist these variant features that can be made invariant via transformation, Furthermore, would it be possible to add some form of feature extraction technique that performs these transformations?

hwarden162 commented 1 month ago

An example of the texture features being rotationally variant:

These features are rotationally variant but information can be salvaged as they are phased and essentially trade off with one another. Rather than blocklisting these features it may be useful to average across directions (or across orthogonal directions) as this demonstrates more stable results

d33bs commented 1 month ago

Thanks for those additional thoughts @hwarden162 ! Based on what you mentioned it made me think about pycytominer.operations.variance_threshold() (which available for use via pycytominer.feature_select()). I don't know that it'd meet the specific needs you describe however. If none of the existing operations would meet the needs maybe it could be a good space for a new pycytominer.operations module and functions.

@gwaybio would you have any thoughts on what @hwarden162 brings up?

axiomcura commented 1 month ago

This is a cool idea!

However, this is a challenging problem, as defining biologically irrelevant features can vary greatly depending on the experimental context. I believe this decision should ultimately be tailored to the specific goals of the experiment. The block list approach seems advantageous as it allows users to specify which features are not relevant in their particular case. Additionally, it would be beneficial to implement a feature that allows users to dynamically expand the block list programmatically (e.g., by loading the list as an object, enabling users to add more features as needed). What are your thoughts on this, @d33bs? Or is there already a feature that supports this functionality?

Taking the code example above, I pictured it something like this:

from pycytominer import feature_select, load_blocklist

# this returns a "BlockList" object containing all the block list features
loaded_blocklist = load_blocklist()

# adding more features
loaded_blocklist.add(["new_block_feature1", "new_block_feature2", "new_block_feature3"])

# here we use the "blocklist" operation with our newly updated block list
non_variant = feature_select(
    normalized_df,
    operation= "blocklist",
    blocklist_features=loaded_blocklist,
    drop_non_bio_variant_data_source="cellprofiler",
)

By default, blocklist_features will be set to None. If the user selects blocklist as an operation, it will automatically use the predefined blocklist specified in the blocklist file.

I believe this is a more careful approach, as it allows the user to define which features should be removed from the dataset and provides justification for those decisions. Relying on a mathematical model to determine what is considered 'biologically irrelevant' may accidentally lead to the removal of features that are actually biologically significant.

hwarden162 commented 1 month ago

I agree that defining biologically irrelevant features is a very dependent task. but the suggestion would be not to remove features that are biologically irrelevant but to remove features that are specifically dependent on factors that aren't biological. To take the bounding box area as an example, as shown above this feature can vary based on the rotation of the microscope. The information contained is this measurement is stil preserved in the area feature.

For me this makes it a no brainer to throw out the bounding box area feature (in the case that it is computed orthogannaly with respect to the image axes) as the information is still contained in other measurements. It does become a different conversation alltogether when we move to other features that have no non-variant analog in the morphological profile (like the texture/haralick features).

gwaybio commented 1 month ago

Thanks for opening this issue @hwarden162 ! My apologies for my delayed reply - I'm just now digging out after meeting several deadlines.

I appreciate this discussion @d33bs and @axiomcura - I see exactly what you're saying and my thoughts align.

My summary

The fastest way to implement Hugh's "drop_non_bio_variant" operation would be through the existing blocklist operation using a custom blocklist_file
For us to add a "drop_non_bio_variant" in the future, we would need convincing evidence that these features should be removed (at least in certain scenarios). Hugh, it looks like you've started this important work, but we would need more of a systematic investigation, and (likely) a publication.
If there's interest, Hugh, we could pursue this investigation jointly. We would have to define roles, and I imagine that you will take the lead and lions share of the work. It looks like you have quite a bit of this already, so do you envision us contributing anything specific? How can we help?

Followup questions for Hugh

However, I have a few followup questions for @hwarden162 , which may inform how we proceed.

Questions related to simulation software

You write:

I have some python code that generates "synthetic" cells (just 2 channel jpg's) of an ellipse and a circle and then rotates and translates these and saves them as files.

Is this code available somewhere? Is the code written in such a way that we can vary intensities as well? How extensible/modular is the code?

You also write:

However, this will then miss some features like the texture module

Do you mean that the current synthetic cell simulation does not have the ability to identify rotationally invariant texture features? If so, then how are you identifying them in the plots you show?

Miscellaneous questions

There are also other features that are rotationally variant but can be transformed to be rotationally invariant (like centre of mass for stains is translationally variant but can be converted to being invariant by converting it to be the distance of the center of mass of the stain from the center of mass of the object mask).

This is interesting! So that I make sure I understand, you are suggesting that there are secondary operations that we can perform after CellProfiler measurements that are more informative? Are there any other examples? You write "average across directions (or across orthogonal directions) as this demonstrates more stable results", but I am not sure what you mean.

Furthermore, would it be possible to add some form of feature extraction technique that performs these transformations?

Yes! I think this is a very interesting idea. It sounds like, after we confirm through a systematic evaluation, that we could add a new transformation. Do you have any sort of eval in mind?

hwarden162 commented 1 month ago

(1) The code is back from when I was new to Python so it's messy but I've uploaded it here: https://github.com/hwarden162/synthetic-cells-proto. It models cells as various simple Polygons and then applies different variation strategies for staining and noise profiles, the code given can already vary staining patterns in various simple manners and should be easy to generate other intensity patterns. It's near the top of my ToDo list to make it into a proper package but the bones of it are there.

(2) Yes, the basic synthetic cell experiment doesn't capture rotational variance of texture features. However, the code I have shared has the ability to capture this by applying a spatially correlated noise map over the top. So it is possible to to represent this (message me if you want to replicate this with the code and can't figure out how) and I have metrics for rotational variance of texture features.

(3) It's been a long time since I looked at them. but yes there are transformations that can either normalise or transform variantt features to be non-variant. One example is rather than keeping the center of mass for an object mask and the center of mass for a stain in the object, is to measure the distance between these center of masses then making it independent from x/y coordinates of the image. Another is to take the mean/std/min/max of the haralick feature directions which makes it less sensitive to rotations. Another is for a feature like orientation, you can normalise orientation using a Von-Mises distribution and then calculate the deviance from the mean to make this independent from the orientation of the camera.

I can perfrom a lot of these analyses on in silico cells and demonstrate rotational and translational variance of features. I assume it wouldn't be too hard to quantify reduction in noise between a data set containing variant features and a different data set containing non-variant features only. The difficulty then comes with motivating the need for extra processing. I guess this could be done by creating in silico pertubations of the synthetic cells and training classifiers on the processed and non-processed data sets but what you would probably really need to cell it would be examples on real world data which is where I think the problems would come up as to how much work it would take to find an example which creates a clear difference. Showing the presence of non-biological noise is pretty easy it is then quantification and demonstration of this problem on real datasets which is where I honestly have no idea what the results would be.

hwarden162 commented 1 month ago

To add to this, here is an example of a "synthetic cytoplasm" that I generate for this.

I fill the shape with anistropic spatially correlated noise that can be detected by the texture modules.

hwarden162 commented 1 month ago

I then rotate and manipulate this high res image as an analog for real world microscope adjustments and then downsize the image to simulate taking a photo.

It's not perfect but allow me to accurately represent real world non-biological affects on morphological measurements as all of the underlying cells are the same.

hwarden162 commented 1 month ago

I've done a bit of follow up work on this to start looking at formally demonstrating the rotational/translational dependency of morphological features and how processing could help us overcome this.

To start with I have generated sets of images with different textural patterns. These are shown below, the top left has small scale spatial correlation and the bottom right has large scale spatial correlation. The top right has small scale spatial correlation along the minor axis of the ellipse and large scale spatial correlation along the major axis and the bottom right image is the opposite.

Screenshot 2024-10-23 at 12 47 11

For each of these groups I generated 2,000 high res images all of which are aligned with the major axis of the ellipse along the horizontal axis. I then downsampled these image to 80x80 pixels and saved them. After this I took the 8,000 original high re images and rotated each one a random amount. I then downsampled these rotated images to 80x80 pixels to generate a paired data set of aligned and randomly rotated images.

I morphologically profiled the texture features of these images using CellProfiler. I then trained random forest models to classify the textural staining pattern based on the texture features. I trained one model on features from the aligned images and one on the features from the rotated images, the training and testing sets for each model were paired so that the same aligned/rotated images were in the same sets.

I then tested each model on both the aligned and rotated testing sets. I plotted how the accuracy of the model fluctuated with the rotation of each image (rounded to the nearest 5 degrees)

As you can see this shows that the model trained on the aligned data struggles with rotations the closer they are to being orthogonal to the original image. This is because the texture features output by CellProfiler are given with respect to different directions (vertical, horizontal and both diagonals). So when an image is at 90 degrees the texture features are the same but appearing in different columns. This makes it difficult for the aligned model to classify as for it the major and minor axes of the cells are always in the same axes as the x and y axes.

As each texture feature is given for 4 directions, I then took the 4 values for each feature and took the min/max/mean/std which are linearly independent and so contain all of the information in the system without encoding directionality. I then trained another random forest on the same data that has been passed through this processing and then compared the results for each rotation again. The model was trained on aligned data and tested on rotated data.

Here you can see that this processing means that the model is able to generalise from the aligned case to the rotated case. This suggests that the processing of the data might have removed some of the rotational variance in the data. At the very least it suggests this may be a useful data processing step for using texture features in data science pipelines.

gwaybio commented 1 month ago

This is great Hugh! I'm wondering a couple things at this stage:

What is the title of our paper?
What are the primary contributions of the paper? I see at least two: 1) A list of CellProfiler features that are rotationally variant. 2) A description of mathematical transformations to overcome this issue. Are there others? Have you considered applying this analysis to DeepProfiler or scDino features?
We need to provide convincing evidence in support of these contributions. I believe the work you have done already is convincing. As we discussed previously, even stronger support would be to do this with real-world data. Perhaps the path forward is to A) finalize the simulated data experiments, B) finalize the transformations, C) add them as a feature branch to pycytominer, D) plan the real-world experiment.
Ideas for planning real-world experiment @jenna-tomkinson and @roshankern recently accepted work in which we predict single-cell phenotypes (see preprint). We ran into an issue (see figure 4) where phenotype prediction models performed poorly in held out images. Essentially, we could 1) remove invariant features and 2) transform invariant features and reapply our leave-one-image-out analysis. This would directly test the hypothesis that rotational variance of CellProfiler features impacts performance of single-cell phenotype prediction.

hwarden162 commented 4 weeks ago

(1) Short snappy titles are not my strong suit but I have been giving this some thought. I was thinking along the lines of "Data processing techniques for translationally and rotationally variant features in morphological profiling".

(2) I agree. The main contributions that I see are categorising the outputs of morphological features (so far just CellProfiler) into 3 groups (1) morphological features that are translationally/rotationally variant and should be removed from data science pipelines (2) morphological features that are translationally/rotationally variant but that can be processed such that they are no longer variant (or at least not as variant) and a description of this process and (3) features that are not translationally/rotationally variant.

I personally have never used DeepProfiler or scDino. All of the features I have identified as variant I have actually just done by looking up the documentation and manually going through every feature and checking how they are measured. This doesn't take long for most features. Furthermore, this can then be backed up by generating synthetic cells, rotating them and passing them through and then plotting features (or some form of feature distance metric) against rotation (or whatever variance you are measuring).

(3) I agree with this. The biggest bonus of using rotationally invariant features is when analysing samples that have a different rotation to the training set. The area which I think is more likely to be impactful in real world applications is the translationally variant centre of mass features and the like (as well as the removal of features which are just noise).

OverallAccuraciesSC

As you can see here the accuracy of predictions made on randomly rotated images when trained on randomly rotated images isn't really affected by this variant feature processing. This is the trouble I have been having with this and why it has taken me so long to do anything with it because in many cases the model output accuracy may not be affected but the model itself is more methodologically sound. (This is important to me as a theoretically minded mathematician but may not have much impact on a results oriented person). I think this makes it a hard sell to people unless there is an easy fix for it (like adding a couple of lines of pycytominer code to their preexisting pipeline).

(4) Obviously this is more for your team than mine. Best case scenario is that this extra processing would improve the outcome. If this didn't happen then there is also the possibility if the pipeline is easy to rerun of filtering your features to use a smaller subset of features that contain more spatially variant features (e.g. removing intensity distribution module features) and rerunning on a smaller subset to force the model to train on these problematic features and then showing (hopefully!) an improvement when running the model again on these features once they have been processed.

gwaybio commented 3 weeks ago

This is great @hwarden162 - I am going to send you an email to discuss manuscript next steps.

In the meantime, would you mind filing a PR into this repo? It would be great to actually contribute two things: 1) A new transformation for rotationally variant features and 2) a rotaional_variance_blocklist file.

It would also be be great for you to refine the simulated cell repository.

hwarden162 commented 3 weeks ago

I'm making a new repo with the cell simulations and the experimental code, I'll link it here but I've been busy with some other projects. Hopefully will be up this week.

As an addition to what I've said above, I repeated the experiment but grouping the cells into groups of 5 which I then called a well. I then took the median of each feature and trained on the aggregated morphologies.

Here are the single cell results:

OverallAccuraciesSC

Here are the well aggregated results

OverallAccuraciesPL

This shows that this processing is even more important on aggregated data and suggests we will probably find it easier to replicate the results on aggregated data too.

Will update below when Ive updated the repos.

hwarden162 commented 3 weeks ago

Updated package for generating images: https://github.com/hwarden162/cellgenerator