Exploring preprocessing pipelines in Scenario 1

johnarevalo commented 10 months ago

We report the performance of multiple preprocessing steps:

mad: Median absolute deviation normalization
clip: clip outlier values to 500.
drop: drop any column with an oulier value.
imputemedian: impute outliers with median value.
imputeknn: impute outiler values with knn.
sphering: whitening transformation estimated with negative controls and applied over the whole dataset.
featselect: Feature selection process using "variance_threshold", "correlation_threshold", operations from pycytominer.
int: rank-based Inverse normal transformation.

The best performing pipeline is mad_drop_int_featselect and mad_int_featselect

Performance comparison

![fig1](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/2e4355cd-7a77-4d73-8469-a6b4c29847ca) ![fig2](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/3278ba4f-7f41-43e2-afdd-d09facb38ab4)

We also tried sphering as an additional step to help aligning values. We searched for the best regularization parameter for the top-2 pipelines described above.

Sphering does not increase the performance metrics

Sphering vs other pipelines (including some batch correction methods)

![fig5](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/f1a0e515-7e14-4b1b-ad92-3429174cb36e) ![fig6_map_negcon](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/6405b80f-cb78-4375-a006-51fe6765cc29) ![fig6_map_nonrep](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/469efbd9-4ad3-4a97-90cf-d00164d56afc)

Sphering exploration

![fig3](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/23108c9f-96fe-428e-bb5e-24a71c56a442)

johnarevalo commented 10 months ago

@shntnu Based on the results above, we will try both mad_drop_int_featselect and mad_int_featselect for other scenarios and will report the best. We are not going to include sphering.

shntnu commented 10 months ago

Thank you for documenting all this. Is it reasonable to say that sphering may have been improving results in the past because it was compensating for some of the issues with the data that are now being fixed by these other new preprocessing steps? Of course, it's hard to test that specifically, but that's the only hypothesis I could come up with.

Also, can you link to the sphering code you used?

johnarevalo commented 10 months ago

Is it reasonable to say that sphering may have been improving results in the past because it was compensating for some of the issues with the data that are now being fixed by these other new preprocessing steps?

Yes, I also think that's a reasonable explanation.

I'm using a copy from your pycytominer PR.

https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/blob/aac9bfd2854984cf0309fcb6786abb0358ec0e7f/zca.py#L16

shntnu commented 10 months ago

Yes, I also think that's a reasonable explanation.

I'm using a copy from your pycytominer PR.

Sounds good

For our notes, we decided it's better to use a copy because we were worried about package conflicts if we updated pycytominer. We could have used a separate conda env for each rule, but this (copying) is simpler (for now :D)

carpenter-singh-lab / 2023_Arevalo_NatComm_BatchCorrection

Exploring preprocessing pipelines in Scenario 1 #4