Closed johnarevalo closed 8 months ago
@shntnu Based on the results above, we will try both mad_drop_int_featselect
and mad_int_featselect
for other scenarios and will report the best. We are not going to include sphering.
Thank you for documenting all this. Is it reasonable to say that sphering may have been improving results in the past because it was compensating for some of the issues with the data that are now being fixed by these other new preprocessing steps? Of course, it's hard to test that specifically, but that's the only hypothesis I could come up with.
Also, can you link to the sphering code you used?
Is it reasonable to say that sphering may have been improving results in the past because it was compensating for some of the issues with the data that are now being fixed by these other new preprocessing steps?
Yes, I also think that's a reasonable explanation.
I'm using a copy from your pycytominer PR.
Yes, I also think that's a reasonable explanation.
I'm using a copy from your pycytominer PR.
Sounds good
For our notes, we decided it's better to use a copy because we were worried about package conflicts if we updated pycytominer
. We could have used a separate conda env for each rule, but this (copying) is simpler (for now :D)
We report the performance of multiple preprocessing steps:
mad
: Median absolute deviation normalizationclip
: clip outlier values to 500.drop
: drop any column with an oulier value.imputemedian
: impute outliers with median value.imputeknn
: impute outiler values with knn.sphering
: whitening transformation estimated with negative controls and applied over the whole dataset.featselect
: Feature selection process using"variance_threshold", "correlation_threshold",
operations from pycytominer.int
: rank-based Inverse normal transformation.The best performing pipeline is
mad_drop_int_featselect
andmad_int_featselect
Performance comparison
![fig1](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/2e4355cd-7a77-4d73-8469-a6b4c29847ca) ![fig2](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/3278ba4f-7f41-43e2-afdd-d09facb38ab4)We also tried sphering as an additional step to help aligning values. We searched for the best regularization parameter for the top-2 pipelines described above.
Sphering does not increase the performance metrics
Sphering vs other pipelines (including some batch correction methods)
![fig5](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/f1a0e515-7e14-4b1b-ad92-3429174cb36e) ![fig6_map_negcon](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/6405b80f-cb78-4375-a006-51fe6765cc29) ![fig6_map_nonrep](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/469efbd9-4ad3-4a97-90cf-d00164d56afc)Sphering exploration
![fig3](https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection/assets/1301626/23108c9f-96fe-428e-bb5e-24a71c56a442)