RoheLab / vsp

Vintage Sparse PCA for Semi-Parametric Network Analysis
https://rohelab.github.io/vsp/dev
Other
25 stars 6 forks source link

remove B from default? #36

Closed karlrohe closed 3 years ago

karlrohe commented 4 years ago

In PCA and traditional factor analysis, there is no "middle B matrix”… this let’s vsp identify a whole new set of structures (two-way independent factors!). This is really cool, but it can be dangerous.

Usually, loading vector j (what we call jth column of Y) has a clear relationship with principal component j (what we call jth column of Z). In vsp, this relationship need not hold. If B is the identity matrix, then everything is fine. If B has a “strong diagonal”, then everything should be fine. However, we are not currently constraining the estimation in any way to ensure that this happens! As such, B could be something like a permutation matrix, or worse! This makes the correspondence between jth column of Y and jth column of Z disappear entirely.

For example, if you fit a topic model to some text with vsp, then you look at the words that occur in topic 5. You identify them. Then, you go see the documents in topic 5… those might in fact correspond to a different topic of words!

For right now, this can be diagnosed with image(B) in R. Look for a strong diagonal. Going forward, we need a better solution… The solution is probably going to be a hack.

One possibility to prevent people from hurting themselves…. don’t estimate both Z and Y unless users specifically ask for it. This removes the estimation of B. Under the default, we would return Z and \beta = (B Y^T). This is more like a traditional topic model.

Alternatively... make a function

find_rows_with_high_Y(topMany, Ycol)

which returns the topMany rows of the input matrix which have the biggest values of (ZB)_Ycol. If we inspected those rows together (e.g. if input is text-corpus, we read those documents together), then this is "viewing the data in model space"

Similarly for find_cols_with_high_Z.

alexpghayes commented 4 years ago

Ah, I unfortunately replied to the email, not the Github issue. I think good options include seriation to make B as diagonal as possible, and extensive documentation of the correspondence between Z and Y. IMO removing the B matrix from the vsp object doesn't actually do anything to prevent misinterpretation: people are still free to look at Z and Y and assume the columns correspond to each other!

On Mon, Mar 9, 2020 at 11:02 AM karlrohe notifications@github.com wrote:

In PCA and traditional factor analysis, there is no "middle B matrix”… this let’s vsp identify a whole new set of structures (two-way independent factors!). This is really cool, but it can be dangerous.

Usually, loading vector j (what we call jth column of Y) has a clear relationship with principal component j (what we call jth column of Z). In vsp, this relationship need not hold. If B is the identity matrix, then everything is fine. If B has a “strong diagonal”, then everything should be fine. However, we are not currently constraining the estimation in any way to ensure that this happens! As such, B could be something like a permutation matrix, or worse! This makes the correspondence between jth column of Y and jth column of Z disappear entirely.

For example, if you fit a topic model to some text with vsp, then you look at the words that occur in topic 5. You identify them. Then, you go see the documents in topic 5… those might in fact correspond to a different topic of words!

For right now, this can be diagnosed with image(B) in R. Look for a strong diagonal. Going forward, we need a better solution… The solution is probably going to be a hack.

One possibility to prevent people from hurting themselves…. don’t estimate both Z and Y unless users specifically ask for it. This removes the estimation of B. Under the default, we would return Z and \beta = (B Y^T). This is more like a traditional topic model.

Alternatively... make a function

find_rows_with_high_Y(topMany, Ycol)

which returns the topMany rows of the input matrix which have the biggest values of (ZB)_Ycol. If we inspected those rows together (e.g. if input is text-corpus, we read those documents together), then this is "viewing the data in model space"

Similarly for find_cols_with_high_Z.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RoheLab/vsp/issues/36?email_source=notifications&email_token=ADTBG25WYQBX6UBZMX4HRMLRGUOL5A5CNFSM4LEMAA6KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ITT2RAQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTBG25STVHY63OHZ6RAHSTRGUOL5ANCNFSM4LEMAA6A .

karlrohe commented 4 years ago

people are still free to look at Z and Y and assume the columns correspond to each other!

By “remove B”… I actually mean return Z and beta = (B Y^T).

On Mar 9, 2020, at 11:13 AM, alex hayes notifications@github.com<mailto:notifications@github.com> wrote:

Ah, I unfortunately replied to the email, not the Github issue. I think good options include seriation to make B as diagonal as possible, and extensive documentation of the correspondence between Z and Y. IMO removing the B matrix from the vsp object doesn't actually do anything to prevent misinterpretation: people are still free to look at Z and Y and assume the columns correspond to each other!

On Mon, Mar 9, 2020 at 11:02 AM karlrohe notifications@github.com<mailto:notifications@github.com> wrote:

In PCA and traditional factor analysis, there is no "middle B matrix”… this let’s vsp identify a whole new set of structures (two-way independent factors!). This is really cool, but it can be dangerous.

Usually, loading vector j (what we call jth column of Y) has a clear relationship with principal component j (what we call jth column of Z). In vsp, this relationship need not hold. If B is the identity matrix, then everything is fine. If B has a “strong diagonal”, then everything should be fine. However, we are not currently constraining the estimation in any way to ensure that this happens! As such, B could be something like a permutation matrix, or worse! This makes the correspondence between jth column of Y and jth column of Z disappear entirely.

For example, if you fit a topic model to some text with vsp, then you look at the words that occur in topic 5. You identify them. Then, you go see the documents in topic 5… those might in fact correspond to a different topic of words!

For right now, this can be diagnosed with image(B) in R. Look for a strong diagonal. Going forward, we need a better solution… The solution is probably going to be a hack.

One possibility to prevent people from hurting themselves…. don’t estimate both Z and Y unless users specifically ask for it. This removes the estimation of B. Under the default, we would return Z and \beta = (B Y^T). This is more like a traditional topic model.

Alternatively... make a function

find_rows_with_high_Y(topMany, Ycol)

which returns the topMany rows of the input matrix which have the biggest values of (ZB)_Ycol. If we inspected those rows together (e.g. if input is text-corpus, we read those documents together), then this is "viewing the data in model space"

Similarly for find_cols_with_high_Z.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RoheLab/vsp/issues/36?email_source=notifications&email_token=ADTBG25WYQBX6UBZMX4HRMLRGUOL5A5CNFSM4LEMAA6KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ITT2RAQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTBG25STVHY63OHZ6RAHSTRGUOL5ANCNFSM4LEMAA6A .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/RoheLab/vsp/issues/36?email_source=notifications&email_token=AB65UIMU6KBUGTRIAR4P533RGUITNA5CNFSM4LEMAA6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOH4T3Y#issuecomment-596625903, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB65UIOCLQEZD32VWUMTKPLRGUITNANCNFSM4LEMAA6A.

alexpghayes commented 3 years ago

If we do explore this, let's do it on a different branch than master.