Clarification on Normalization function

rjesud commented 2 years ago

Firstly, thanks for releasing this tool. It has been a great resource for our projects.

I am looking for clarification on the normalization function - namely, when is it appropriate to use/skip the "samples" argument and how exactly is the normalization method using this flagged data. Is this only necessary when combining data from multiple plates (each with control/dmso wells)?

Secondly, is it correct to say that the 'spherize' method requires the 'samples' argument? And this should be executed AFTER a first round normalization and feature selection?

Thank you!

gwaybio commented 2 years ago

Glad to hear you've found our tool useful @rjesud !

samples is a flexible argument that must be input in the format of a pandas.DataFrame.query(). All it is doing is learning a specific transform (e.g. zscore) for only the samples provided in the query and then applying this transform to the full data. So, for example, you might want to use the argument if you do, as you say, are combining data from multiple plates that contain the same controls but different treatments.

spherize doesn't "require" the samples argument - it should work perfectly fine with the default. However, in practice, you probably should use the samples argument. You can find a direct example of our pipeline here: https://github.com/broadinstitute/lincs-cell-painting/. You'll probably derive the most benefit from the profiles and spherized_profiles folders. You'll notice that we did apply spherize afer the first round of normalization and feature selection (see here), but this is likely data-specific and a case-by-case basis.

I am also generally very interested in how you're using the package. We are thinking about writing this paper up, and it might be nice to discuss potential use cases. Additionally, it would be great to highlight other, community examples by providing links to examples somewhere in this repo!

rjesud commented 2 years ago

Thank you, @gwaygenomics!

This is helpful.

I am interested in doing single-cell phenotyping of the cells that have been "cell painted" and exposed to some perturbation. I would like to use this package's normalization and feature selection functions to preprocess my single-cell data before phenotyping by some clustering algorithm (TBD-maybe phenograph).

So, is it safe to say, I should skip the aggregation and consensus functions? I assume this would move me away from single-cell data as my clustering input?

gwaybio commented 2 years ago

gotcha!

Yes, if you are analyzing single cells, skip aggregation and consensus.

We have not used pycytominer extensively for single cell applications, so we would love to have your insights as you use the package. If you have any comments about usage or any feature requests, please open new issues. As pycytominer is an open source package, we also welcome any direct contributions as well.

rjesud commented 2 years ago

Great. Thanks again for your work on this tool and I will be sure to stay connected with the effort as our project develops.

r

cytomining / pycytominer

Clarification on Normalization function #177