cytomining / pycytominer

Python package for processing image-based profiling data
https://pycytominer.readthedocs.io
BSD 3-Clause "New" or "Revised" License
79 stars 35 forks source link

Determine epsilon from the data #116

Open niranjchandrasekaran opened 3 years ago

niranjchandrasekaran commented 3 years ago

The current implementation of sphering in pycytominer uses a constant value (1e-6) for the regularization parameter epsilon.

https://github.com/cytomining/pycytominer/blob/a5dac9e3fa3cdf61e9607f479ba53eac7fed18b1/pycytominer/operations/transform.py#L25

Sphering performance may improve if the value of epsilon is determined directly from the data using @shntnu's approach where epsilon is one-tenth the eigenvalue at the knee of the the eigenvalue curve.

Here is crude implementation of this approach that I wrote used the kneed package. We may want to rewrite and add it to the sphering method in pycytominer.

gwaybio commented 3 years ago

I'm revisiting this now, since I'm adding epsilon to normalize.py in #132

@niranjchandrasekaran - I think this enhancement is cool, but it is beyond scope of #132. Once we merge #132, then we can tackle this, if it becomes necessary.

My overall strategy would be to add a new file - something like pycytominer.cyto_utils.normalize_utils.py where you would write the function estimate_epsilon_regularization().

We then can enable the option normalize(spherize_epsilon="auto") in the normalize function. I am not sure if we want to change spherize_epsilon to default to auto.

The only other thing I would say is that we should do our best to avoid any additional dependencies. We were burned in the past with deprecated packages (example cytomining/cytominer-database#108) and we shouldn't introduce dependencies that we only really would ever use in rare occasions. If this comes up again as important, then would it be possible to avoid using kneed?

niranjchandrasekaran commented 3 years ago

We could implement a simple version of kneed ourselves that needs to work only with eigenvalue curves. But I haven't read the paper so I am not sure how easy it will be - https://raghavan.usc.edu//papers/kneedle-simplex11.pdf

shntnu commented 3 years ago

@gwaygenomics LMK if my pondering this will help unblock the profiling comparison paper. I'll move it out of my inbox, but ping me if it becomes relevant.