USCbiostats / partition

A fast and flexible framework for data reduction in R
https://uscbiostats.github.io/partition/
Other
34 stars 5 forks source link

Improve documentation for `part_pc1()` and allow `reduce_first_component()` to operate independently #40

Open jtian123 opened 1 month ago

jtian123 commented 1 month ago

I am currently working with the part_icc() partitioner, which employs the direct-measure-reduce approach as follows:

Direct: direct_distance(), Minimum Distance Measure: measure_icc(), Intraclass Correlation Reduce: reduce_scaled_mean(), Scaled Row Means My intention was to retain the "direct" and "measure" stages unchanged while switching the "reduce" stage to reduce_first_component(). However, I encountered an issue where reduce_first_component() appears to be almost an "empty" function, with the primary functionality being handled within measure_variance_explained().

This is contrasted with the structure of another partitioner, part_pc1(), which also uses the direct-measure-reduce approach:

Direct: direct_distance(), Minimum Distance Measure: measure_variance_explained(), Variance Explained (PCA) Reduce: reduce_first_component(), First Principal Component It seems that in the part_pc1() partitioner, measure_variance_explained() is effectively performing the reduction task, making reduce_first_component() redundant and potentially misleading for users who intend to utilize it as a standalone reducer.

I propose separating the reduction functionality from measure_variance_explained() and enhancing reduce_first_component() to function as a true reducer. This modification would clarify the roles of these functions and facilitate easier customization for users.

Could this enhancement be considered for implementation to improve clarity and functionality?

malcolmbarrett commented 1 month ago

Yes, I see what you're saying here. But do you think it's possible to separate without fitting the PCA twice? I imagine that could add a lot of computation time on large datasets. I'm not sure we can generate the variance explained without fitting the PCA, but I may be mistaken.

I'm traveling right now, but I'm happy to look at a pull request if you're interested in fleshing this out. Otherwise, I can get to it when I'm back.

millstei commented 1 month ago

To estimate variance explained, we would need to compute PCA, because the variance explained relates to the reduced feature.

Josh


From: Malcolm Barrett @.> Sent: Wednesday, August 7, 2024 7:19 AM To: USCbiostats/partition @.> Cc: Subscribed @.***> Subject: Re: [USCbiostats/partition] Can't customize part_icc with new reducer: reduce_first_component (Issue #40)

Yes, I see what you're saying here. But do you think it's possible to separate without fitting the PCA twice? I imagine that could add a lot of computation time on large datasets. I'm not sure we can generate the variance explained without fitting the PCA, but I may be mistaken.

I'm traveling right now, but I'm happy to look at a pull request if you're interested in fleshing this out. Otherwise, I can get to it when I'm back.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/USCbiostats/partition/issues/40*issuecomment-2273592506__;Iw!!LIr3w8kk_Xxm!pXn4wFi-Af1T7ZjwMMA1z8G5idVmtEoFIjsr4egosWfNG5JXUBPpZP4Y_3vyJ51x_-BNx_fqtEEkTQUirsbh2XTMD0GPJQ$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AFQJR4DVE2MCBWDQMM3TUZLZQIUGPAVCNFSM6AAAAABLTUS2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTGU4TENJQGY__;!!LIr3w8kk_Xxm!pXn4wFi-Af1T7ZjwMMA1z8G5idVmtEoFIjsr4egosWfNG5JXUBPpZP4Y_3vyJ51x_-BNx_fqtEEkTQUirsbh2XQC9wTWeg$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

malcolmbarrett commented 1 month ago

Yes, this is all coming back to me now. I don't think we can avoid that detail, but I think we can address the issue.

Because the two can't be separated, when used together, measure_variance_explained() stores the first component for (see these lines) for reduce_first_component() to use later without needing to recompute it. So I think the principle is right and it just uses a computationally efficient approach.

Your example arguably shows a bug in reduce_first_component(). Just as measure_variance_explained() only stores the first component if that's the reducer, reduce_first_component() should compute the PCA if it hasn't already been computed by measure_variance_explained().

So I think there are two areas of improvement here: 1) better documenting this relationship between measure_variance_explained() and reduce_first_component() and 2) allowing reduce_first_component() to operate independently of measure_variance_explained() by computing the PCA itself when needed (and only when needed)

jtian123 commented 1 month ago

Yes, the two important areas of improvement you summarized are exactly what I want to say in the issue report.

Thanks, Jinyao

Get Outlook for iOShttps://aka.ms/o0ukef


From: Malcolm Barrett @.> Sent: Wednesday, August 7, 2024 10:09:24 AM To: USCbiostats/partition @.> Cc: Jinyao Tian @.>; Author @.> Subject: Re: [USCbiostats/partition] Can't customize part_icc with new reducer: reduce_first_component (Issue #40)

Yes, this is all coming back to me now. I don't think we can avoid that detail, but I think we can address the issue.

Because the two can't be separated, when used together, measure_variance_explained() stores the first component for (see these lineshttps://urldefense.com/v3/__https://github.com/USCbiostats/partition/blob/5f7366a9e73ec76ec775276b0dffe4361ed88f68/R/metrics.R*L168-L176__;Iw!!LIr3w8kk_Xxm!u5-ZO6nCkHXkYEIXWFLJimOcnIj8UvkZZIaZA3oclC9-97FVruon_FlaK6CXJQIu3XNcYTU9pNIK0KJ_uu8Ox-np$) for reduce_first_component() to use later without needing to recompute it. So I think the principle is right and it just uses a computationally efficient approach.

Your example arguably shows a bug in reduce_first_component(). Just as measure_variance_explained() only stores the first component if that's the reducer, reduce_first_component() should compute the PCA if it hasn't already been computed by measure_variance_explained().

So I think there are two areas of improvement here: 1) better documenting this relationship between measure_variance_explained() and reduce_first_component() and 2) allowing reduce_first_component() to operate independently of measure_variance_explained() but computing the PCA itself when needed (and only when needed)

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/USCbiostats/partition/issues/40*issuecomment-2273934736__;Iw!!LIr3w8kk_Xxm!u5-ZO6nCkHXkYEIXWFLJimOcnIj8UvkZZIaZA3oclC9-97FVruon_FlaK6CXJQIu3XNcYTU9pNIK0KJ_uu1q6lmd$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AS25FF6KWXMUI752BEJWXBTZQJIEJAVCNFSM6AAAAABLTUS2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTHEZTINZTGY__;!!LIr3w8kk_Xxm!u5-ZO6nCkHXkYEIXWFLJimOcnIj8UvkZZIaZA3oclC9-97FVruon_FlaK6CXJQIu3XNcYTU9pNIK0KJ_utJMl013$. You are receiving this because you authored the thread.Message ID: @.***>

malcolmbarrett commented 1 month ago

Thanks for confirming and for both of your input. I'll try to address this soon now that it's clearer in my mind