MoseleyBioinformaticsLab / visualizationQualityControl

Visualization methods for omics dataset quality control
Other
9 stars 5 forks source link

PCA using the augmented correlation matrix #13

Open rmflight opened 7 years ago

rmflight commented 7 years ago

It would be really cool to be able to do a PCA decomposition on the augmented or weighted correlation matrix generated by pairwise_correlations, so that the PCA actually reflects the augmented correlation directly.

There may be a way to do this via eigen and then generating the scores, keeping in mind that PCA on the correlation is already scaled and centered.

Note that I think we would have to set the diagonal to 1 for this to work properly.

Thoughts @hunter-moseley ??

rmflight commented 7 years ago

This could be tested by generating a correlation matrix for data with non-missing values, and verifying that the centered / scaled PCA results match those from the correlation matrix.

rmflight commented 7 years ago

Possibly helpful posts:

hunter-moseley commented 7 years ago

I think of this from the stand-point of embedding from a distance matrix. The correlation can be viewed as a normalized distance matrix and this is used to embed the rows/columns into an Euclidean space. Starting to understand the link you sent where the covariance matrix or correlation matrix shows dependency between variables which can be used to collapse the number of variables into principal components by calculating significant eigenvectors with large eigenvalues.

hunter-moseley commented 7 years ago

Just realized that the correlation matrix needs to be between the features and not the samples. If the current PCA we are using is not dropping zeros, then this approach is going to dramatically change the PCA results, since the correlation will be limited to features the co-occur and not over-weighted by the zeros.

rmflight commented 7 years ago

Yes, you are right, it needs to be between features.

Right now, there is no way I know of to drop the zeros. On log scale they are either zeros (log1p), or log of 1e-8 or so.

So the current PCA is more similar to doing correlation without dropping zeros in the sample. I'd have to look at that again to know how it compares to the augmented correlation.

On Fri, Jan 27, 2017, 10:26 PM Hunter Moseley notifications@github.com wrote:

Just realized that the correlation matrix needs to be between the features and not the samples. If the current PCA we are using is not dropping zeros, then this approach is going to dramatically change the PCA results, since the correlation will be limited to features the co-occur and not over-weighted by the zeros.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/visualizationQualityControl/issues/13#issuecomment-275824034, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcI-v5MQ_gy1zyCPD4kUCwrEQwZyvG2ks5rWrVOgaJpZM4LwDpx .