Open jtian123 opened 1 month ago
Yes, I see what you're saying here. But do you think it's possible to separate without fitting the PCA twice? I imagine that could add a lot of computation time on large datasets. I'm not sure we can generate the variance explained without fitting the PCA, but I may be mistaken.
I'm traveling right now, but I'm happy to look at a pull request if you're interested in fleshing this out. Otherwise, I can get to it when I'm back.
To estimate variance explained, we would need to compute PCA, because the variance explained relates to the reduced feature.
Josh
From: Malcolm Barrett @.> Sent: Wednesday, August 7, 2024 7:19 AM To: USCbiostats/partition @.> Cc: Subscribed @.***> Subject: Re: [USCbiostats/partition] Can't customize part_icc with new reducer: reduce_first_component (Issue #40)
Yes, I see what you're saying here. But do you think it's possible to separate without fitting the PCA twice? I imagine that could add a lot of computation time on large datasets. I'm not sure we can generate the variance explained without fitting the PCA, but I may be mistaken.
I'm traveling right now, but I'm happy to look at a pull request if you're interested in fleshing this out. Otherwise, I can get to it when I'm back.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/USCbiostats/partition/issues/40*issuecomment-2273592506__;Iw!!LIr3w8kk_Xxm!pXn4wFi-Af1T7ZjwMMA1z8G5idVmtEoFIjsr4egosWfNG5JXUBPpZP4Y_3vyJ51x_-BNx_fqtEEkTQUirsbh2XTMD0GPJQ$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AFQJR4DVE2MCBWDQMM3TUZLZQIUGPAVCNFSM6AAAAABLTUS2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTGU4TENJQGY__;!!LIr3w8kk_Xxm!pXn4wFi-Af1T7ZjwMMA1z8G5idVmtEoFIjsr4egosWfNG5JXUBPpZP4Y_3vyJ51x_-BNx_fqtEEkTQUirsbh2XQC9wTWeg$. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Yes, this is all coming back to me now. I don't think we can avoid that detail, but I think we can address the issue.
Because the two can't be separated, when used together, measure_variance_explained()
stores the first component for (see these lines) for reduce_first_component()
to use later without needing to recompute it. So I think the principle is right and it just uses a computationally efficient approach.
Your example arguably shows a bug in reduce_first_component()
. Just as measure_variance_explained()
only stores the first component if that's the reducer, reduce_first_component()
should compute the PCA if it hasn't already been computed by measure_variance_explained()
.
So I think there are two areas of improvement here: 1) better documenting this relationship between measure_variance_explained()
and reduce_first_component()
and 2) allowing reduce_first_component()
to operate independently of measure_variance_explained()
by computing the PCA itself when needed (and only when needed)
Yes, the two important areas of improvement you summarized are exactly what I want to say in the issue report.
Thanks, Jinyao
Get Outlook for iOShttps://aka.ms/o0ukef
From: Malcolm Barrett @.> Sent: Wednesday, August 7, 2024 10:09:24 AM To: USCbiostats/partition @.> Cc: Jinyao Tian @.>; Author @.> Subject: Re: [USCbiostats/partition] Can't customize part_icc with new reducer: reduce_first_component (Issue #40)
Yes, this is all coming back to me now. I don't think we can avoid that detail, but I think we can address the issue.
Because the two can't be separated, when used together, measure_variance_explained() stores the first component for (see these lineshttps://urldefense.com/v3/__https://github.com/USCbiostats/partition/blob/5f7366a9e73ec76ec775276b0dffe4361ed88f68/R/metrics.R*L168-L176__;Iw!!LIr3w8kk_Xxm!u5-ZO6nCkHXkYEIXWFLJimOcnIj8UvkZZIaZA3oclC9-97FVruon_FlaK6CXJQIu3XNcYTU9pNIK0KJ_uu8Ox-np$) for reduce_first_component() to use later without needing to recompute it. So I think the principle is right and it just uses a computationally efficient approach.
Your example arguably shows a bug in reduce_first_component(). Just as measure_variance_explained() only stores the first component if that's the reducer, reduce_first_component() should compute the PCA if it hasn't already been computed by measure_variance_explained().
So I think there are two areas of improvement here: 1) better documenting this relationship between measure_variance_explained() and reduce_first_component() and 2) allowing reduce_first_component() to operate independently of measure_variance_explained() but computing the PCA itself when needed (and only when needed)
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/USCbiostats/partition/issues/40*issuecomment-2273934736__;Iw!!LIr3w8kk_Xxm!u5-ZO6nCkHXkYEIXWFLJimOcnIj8UvkZZIaZA3oclC9-97FVruon_FlaK6CXJQIu3XNcYTU9pNIK0KJ_uu1q6lmd$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AS25FF6KWXMUI752BEJWXBTZQJIEJAVCNFSM6AAAAABLTUS2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTHEZTINZTGY__;!!LIr3w8kk_Xxm!u5-ZO6nCkHXkYEIXWFLJimOcnIj8UvkZZIaZA3oclC9-97FVruon_FlaK6CXJQIu3XNcYTU9pNIK0KJ_utJMl013$. You are receiving this because you authored the thread.Message ID: @.***>
Thanks for confirming and for both of your input. I'll try to address this soon now that it's clearer in my mind
I am currently working with the part_icc() partitioner, which employs the direct-measure-reduce approach as follows:
Direct: direct_distance(), Minimum Distance Measure: measure_icc(), Intraclass Correlation Reduce: reduce_scaled_mean(), Scaled Row Means My intention was to retain the "direct" and "measure" stages unchanged while switching the "reduce" stage to reduce_first_component(). However, I encountered an issue where reduce_first_component() appears to be almost an "empty" function, with the primary functionality being handled within measure_variance_explained().
This is contrasted with the structure of another partitioner, part_pc1(), which also uses the direct-measure-reduce approach:
Direct: direct_distance(), Minimum Distance Measure: measure_variance_explained(), Variance Explained (PCA) Reduce: reduce_first_component(), First Principal Component It seems that in the part_pc1() partitioner, measure_variance_explained() is effectively performing the reduction task, making reduce_first_component() redundant and potentially misleading for users who intend to utilize it as a standalone reducer.
I propose separating the reduction functionality from measure_variance_explained() and enhancing reduce_first_component() to function as a true reducer. This modification would clarify the roles of these functions and facilitate easier customization for users.
Could this enhancement be considered for implementation to improve clarity and functionality?