gabrielodom / pathwayPCA

integrative pathway analysis with modern PCA methodology and gene selection
https://gabrielodom.github.io/pathwayPCA/
11 stars 2 forks source link

aespca warning #65

Closed gabrielodom closed 2 years ago

gabrielodom commented 5 years ago

When I call aespca() on the full colonSurv_df data frame, I get this warning a million times:

In (Cmax + C)/(A + a) :
  longer object length is not a multiple of shorter object length

It doesn't show up for all subsets of this data, however (columns 5:50 for example).

gabrielodom commented 2 years ago

I can get 12 warnings if I use

library(pathwayPCA)
data("colonSurv_df")
aespca(as.matrix(colonSurv_df[, 4:264]))

If I include columns 4:263, I get no warnings; if I include columns 4:265 I get 44 warnings. Including columns 10:264 gives me no warnings; columns 5:265 yields 22; 5:264 yields 30; 104:363 yields 8; 304:563 yields 4. I think the warnings are due to having more features than samples. This won't come up for most pathways, but we still need to track down why this bug is happening and fix it.

gabrielodom commented 2 years ago

I tried this 5 times with no warnings:

randCols_idx <- sample(4:ncol(colonSurv_df), size = 250, replace = FALSE)
aespca(as.matrix(colonSurv_df[, randCols_idx]))

I increased the size to 251 and repeated 5x; I got 0, 0, 0, 14, 0 warnings. Here are other attempts: p = 252; 0, 0, 0, 0, 0 p = 255; 0, 0, 0, 0, 0 p = 260; 50+, 3, 20, 2, 2

Let me try to set a seed, now that I can consistently trigger these warnings.

set.seed(8675309)
randCols_idx <- sample(4:ncol(colonSurv_df), size = 260, replace = FALSE)
aespca(as.matrix(colonSurv_df[, randCols_idx]))

I was able to track this to the lars.lsa() function. I'm digging there next

gabrielodom commented 2 years ago

Inside lars.lsa(), the two vectos are C and a. C is defined at the top of the while() loop, while a is defined at the bottom. I'm still trying to track down why in some cases they end up with different lengths (but only when $p > N$).

gabrielodom commented 2 years ago

I used a browser to walk through a few dozen iterations past step = 270 (see the example code in scripts/issue_65_20220222.R). I then added another hatch to handle cases when a and C are different lengths (borrowing the same hatch for when the number of "active" features is larger than the dimension of Sigma):

dropCols_idx <- c(active, unique(ignores))
if ( length(C) != (m - length(dropCols_idx)) ) {

  gamhat <- Cmax / A

}