DataSlingers / clustRviz

Compute Convex (Bi)Clustering Solutions via Algorithmic Regularization
https://DataSlingers.github.io/clustRviz/
GNU General Public License v3.0
19 stars 14 forks source link

Biclustering #112

Closed dansenglund closed 4 years ago

michaelweylandt commented 4 years ago

I'm going to go ahead and merge this, but there are definitely some issues that need addressing in future PRs. I'll open some issues, linked to this PR, to track them.

In brief:

1) The following script, run on current master and on this PR, shows that the current algorithm seems to perform better wall-clock wise and path-approximation-wise. The differences are minor, and can probably be addressed by additional parameter tuning and optimization, so I think it's worth merging to get the additional future features enabled by this PR.

clustRviz_options(keep = 1, keep_debug_info = TRUE)
X <- presidential_speech[1:15, 1:15]
cbass_fit <- CBASS(X, exact = FALSE, back_track = FALSE, X.center.global = FALSE)
exact_fit <- CBASS(X, exact = TRUE,  back_track = FALSE, X.center.global = FALSE)

D_row <- cbass_fit$row_fusions$D
D_col <- t(cbass_fit$col_fusions$D)

w_row <- cbass_fit$row_fusions$weights; w_row <- clustRviz:::weight_mat_to_vec(w_row); w_row <- w_row[w_row != 0]
w_col <- cbass_fit$col_fusions$weights; w_col <- clustRviz:::weight_mat_to_vec(w_col); w_col <- w_col[w_col != 0]

w_row <- w_row / sum(w_row); w_row <- w_row / sqrt(NROW(X))
w_col <- w_col / sum(w_col); w_col <- w_col / sqrt(NCOL(X))

cbass_gamma <- cbass_fit$debug$path$gamma_path
cbass_U     <- array(cbass_fit$debug$path$u_path, c(NROW(X), NCOL(X), length(cbass_gamma)))

exact_gamma <- exact_fit$debug$path$gamma_path
exact_U     <- array(exact_fit$debug$path$u_path, c(NROW(X), NCOL(X), length(exact_gamma)))

obj_func <- function(U, lambda, l1 = FALSE){
    DU <- D_row %*% U; DTU <- D_col %*% t(U)
    if(l1){
        sum((X - U)^2)/2 + lambda * sum(w_row * abs(DU)) + lambda * sum(w_col * abs(t(DTU)))
    } else {
        sum((X - U)^2)/2 + lambda * sum(w_row * sqrt(rowSums(DU^2))) + lambda * sum(w_col * sqrt(rowSums(DTU^2)))
    }
}

obj_cbass <- numeric(length(cbass_gamma))
obj_exact <- numeric(length(exact_gamma))

for(ix in seq_along(cbass_gamma)){
    gamma <- cbass_gamma[ix]
    U_hat <- cbass_U[,,ix]
    obj_cbass[ix] <- obj_func(U_hat, gamma)
}

for(ix in seq_along(exact_gamma)){
    gamma <- exact_gamma[ix]
    U_hat <- exact_U[,,ix]
    obj_exact[ix] <- obj_func(U_hat, gamma)
}

# FIXME These seem to line up - why aren't they exactly aligned?
range(cbass_gamma[90:2252] / exact_gamma[92:2254])

plot(cbass_gamma[90:2252], obj_cbass[90:2252] / obj_exact[92:2254])

2) We need some more user-facing flexibility on the exact = TRUE ADMM. I'd suggest that the number of per gamma iterations (currently 2501) and the convergence thresholds be made consistent across clustering and bi-clustering and configurable through clustRviz_options.

3) The algorithm string printed by print.CBASS is no longer accurate.

4) The calls to reset_aux should be removed. Right now, they only have the effect of slowing down convergence with exact = TRUE by resetting U for no real reason.