Performance regression - Githubissues

biona001 commented 8 months ago

As reported by Jiaqi, there seems to be some significant performance regression for group knockoff solver.

MWE: (Knockoffs.jl v1.1.7 and on Julia 1.10)

using Knockoffs
using LinearAlgebra
using Random
using CSV
using DataFrames

# gnomad test data
datadir = "/Users/biona001/Benjamin_Folder/research/4th_project_groupKO/group_knockoff_test_data"
mapfile = CSV.read(joinpath(datadir, "Groups_2_127374341_128034347.txt"), DataFrame)
groups = mapfile[!, :group]
covfile = CSV.read(joinpath(datadir, "CorG_2_127374341_128034347.txt"), DataFrame)
Σ = covfile |> Matrix{Float64}
Σ = 0.99Σ + 0.01I

@time solve_s_group(
    Symmetric(Σ), groups, :maxent, 
    m = 1,          # number of knockoffs per variable to generate
    tol = 0.001,    # convergence tolerance
    outer_iter = 10,    # max number of coordinate descent iterations
    robust = false, # whether to use robust cholesky updates
    verbose = true    # whether to print informative intermediate results
)

┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs ~/.julia/dev/Knockoffs/src/group.jl:360
Maxent initial obj = -44822.11340914695
Iter 1 (PCA): obj = -26794.756871827496, δ = 2.6915336716594664, t1 = 24.96, t2 = 32.32
Iter 2 (CCD): obj = -23681.635922280064, δ = 0.5920258362537, t1 = 534.9, t2 = 881.71, t3 = 0.22
Iter 3 (PCA): obj = -23476.599883936684, δ = 1.0851838794654751, t1 = 559.7, t2 = 913.86
Iter 4 (CCD): obj = -23441.194586522815, δ = 0.01917853738286209, t1 = 1059.27, t2 = 1761.83, t3 = 0.44
Iter 5 (PCA): obj = -23428.13516213957, δ = 0.20912865614936338, t1 = 1085.4, t2 = 1794.69
... # terminated

On Knockoffs.jl v0.3.0 (Julia 1.6.7), see Fast Cholesky update notebook

solve_group_max_entropy_ccd: Optimizing 58997 variables
┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs /Users/biona001/.julia/dev/Knockoffs/src/group.jl:230
initial obj = -13811.93738418022
Iter 1: obj = -8863.340455999352, δ = 0.9033274319764998, t1 = 8.85, t2 = 18.23, t3 = 0.04
Iter 2: obj = -7509.4699925543855, δ = 0.7627302489344706, t1 = 19.04, t2 = 36.57, t3 = 0.08
Iter 3: obj = -7495.512979153219, δ = 0.153471691368552, t1 = 29.25, t2 = 54.77, t3 = 0.12
Iter 4: obj = -7491.027981075002, δ = 0.0467433749392695, t1 = 39.2, t2 = 72.95, t3 = 0.16
Iter 5: obj = -7489.0243504490645, δ = 0.007200479138871564, t1 = 49.08, t2 = 90.98, t3 = 0.2
Iter 6: obj = -7487.899557778442, δ = 0.004873425110742543, t1 = 58.98, t2 = 109.34, t3 = 0.24
Iter 7: obj = -7487.198762711129, δ = 0.0031886466238885947, t1 = 68.75, t2 = 127.98, t3 = 0.28
Iter 8: obj = -7486.7318922165505, δ = 0.002106998635746351, t1 = 78.17, t2 = 146.58, t3 = 0.32
Iter 9: obj = -7486.405698134362, δ = 0.0011212737859179697, t1 = 87.39, t2 = 165.21, t3 = 0.37
Iter 10: obj = -7486.1699950030525, δ = 0.0006708441935208511, t1 = 96.52, t2 = 183.44, t3 = 0.41
283.395020 seconds (3.23 M allocations: 329.596 MiB, 0.05% gc time, 0.61% compilation time)

Initial objective value being different is due to different initialization of S. However the timing seems much slower, even if ran on different computers.

biona001 commented 8 months ago

After a bit more digging, I am more inclined to think there are no performance regression, because the original problem likely ran on a subset of the variables.

p = 1241 # this includes group 263, which is the largest group with 192 members
groups = groups[1:p]
Σ = Σ[1:p, 1:p]

# Knockoffs.jl version 1.1.7
@time solve_s_group(
    Symmetric(Σ), groups, :maxent, 
    m = 1,          # number of knockoffs per variable to generate
    tol = 0.00001,    # convergence tolerance
    outer_iter = 10,   # max number of coordinate descent iterations
    inner_ccd_iter = 1,
    inner_pca_iter = 0,
    robust = false,   # whether to use robust cholesky updates
    verbose = true    # whether to print informative intermediate results
);

┌ Warning: Maximum group size is 192, optimization may be slow. Consider running `modelX_gaussian_rep_group_knockoffs` to speed up convergence.
└ @ Knockoffs /home/groups/sabatti/.julia/packages/Knockoffs/Pvo6U/src/group.jl:360
Maxent initial obj = -13827.762715952922
Iter 1 (CCD): obj = -8578.305552411875, δ = 0.9025475541472799, t1 = 11.23, t2 = 23.98, t3 = 0.06
Iter 2 (CCD): obj = -7555.017497751844, δ = 0.7478481743005219, t1 = 23.57, t2 = 47.4, t3 = 0.13
Iter 3 (CCD): obj = -7550.483748831955, δ = 0.018305715936092613, t1 = 35.33, t2 = 70.84, t3 = 0.21
Iter 4 (CCD): obj = -7548.862400282033, δ = 0.016699603678850573, t1 = 46.44, t2 = 94.33, t3 = 0.28
Iter 5 (CCD): obj = -7548.3273548868, δ = 0.008661606230114097, t1 = 56.82, t2 = 117.76, t3 = 0.35
Iter 6 (CCD): obj = -7548.015458746686, δ = 0.00498206737500809, t1 = 66.43, t2 = 141.06, t3 = 0.42
Iter 7 (CCD): obj = -7547.793581622985, δ = 0.0026061168087445695, t1 = 75.48, t2 = 164.59, t3 = 0.49
Iter 8 (CCD): obj = -7547.619776039311, δ = 0.0014028998110954276, t1 = 83.86, t2 = 188.1, t3 = 0.56
Iter 9 (CCD): obj = -7547.4770852558795, δ = 0.0007153736353831524, t1 = 91.84, t2 = 211.48, t3 = 0.64
Iter 10 (CCD): obj = -7547.358121710283, δ = 0.0002760958077924706, t1 = 99.6, t2 = 234.89, t3 = 0.71
336.402501 seconds (595.27 k allocations: 357.114 MiB, 0.01% gc time)

With p=1241, the overall time and objective value is much closer to the value attained 2 years ago.

biona001 commented 8 months ago

A few more hours later, I'm starting to think PCA updates are slower than they should be. Here is a test on 9000 by 9000 input matrix

file = "/oak/stanford/groups/zihuai/HighD_Sigma/UCB_cor.csv"
Σ = readdlm(file, ',', skipstart=1)
groups = hc_partition_groups(Symmetric(Σ), cutoff=0.5, linkage=:average)
group_reps = choose_group_reps(Symmetric(Σ), groups, threshold=0.5);

On Julia v1.10 & Knockoffs.jl version 1.1.7:

solve_s_graphical_group(
    Symmetric(Σ), groups, group_reps, :maxent; 
    m = 5, verbose = true, inner_pca_iter=1,
    inner_ccd_iter=1, outer_iter = 2
)

8999 representatives for 9132 variables, 9033 optimization variables
PCA optimization: 8999 variables
CCD optimization: 8999 diagonal variables
CCD optimization: 17 off-diagonal variables
Maxent initial obj = -130896.16543115964
Iter 1 (PCA): obj = -124299.86980415294, δ = 0.5286405884857434, t1 = 1279.42, t2 = 278.81
Iter 2 (CCD): obj = -123656.14555594444, δ = 0.3698519162684533, t1 = 43.83, t2 = 274.62, t3 = 0.0
Iter 3 (PCA): obj = -118778.12068774736, δ = 0.42098723933166904, t1 = 2481.3, t2 = 524.16
Iter 4 (CCD): obj = -118445.11098235719, δ = 0.2330629376150859, t1 = 86.36, t2 = 513.34, t3 = 0.0
PCA timings: t1_pca = 2481.29556, t2_pca = 524.15894, t3_pca = 0.0
CCD timings: t1_ccd = 86.36043, t2_ccd = 513.33717, t3_ccd = 0.00045

It's clear that t1 (time to update Cholesky factors) is a lot slower for PCA updates than CCD updates, which doesn't make much sense to me since CCD requires more, not less, Cholesky updates. Thus I added some temporary code to check how many "early Cholesky terminations" happen for PCA vs CCD:

solve_s_graphical_group(
    Symmetric(Σ), groups, group_reps, :maxent; 
    m = 5, verbose = true, inner_pca_iter=1,
    inner_ccd_iter=1, outer_iter = 10
)

8999 representatives for 9132 variables, 9033 optimization variables
PCA optimization: 8999 variables
CCD optimization: 8999 diagonal variables
CCD optimization: 17 off-diagonal variables
Maxent initial obj = -130896.16543115964
Iter 1 (PCA): obj = -124299.86980415294, δ = 0.5286405884857434, t1 = 1279.98, t2 = 287.63
pca_early_terminated = 9019
Iter 2 (CCD): obj = -123656.14555594444, δ = 0.3698519162684533, t1 = 42.88, t2 = 273.52, t3 = 0.0
ccd_early_terminated = 207
Iter 3 (PCA): obj = -118778.12068774736, δ = 0.42098723933166904, t1 = 2526.84, t2 = 557.25
pca_early_terminated = 9019
Iter 4 (CCD): obj = -118445.11098235719, δ = 0.2330629376150859, t1 = 85.95, t2 = 531.38, t3 = 0.0
ccd_early_terminated = 231
Iter 5 (PCA): obj = -114810.46063950789, δ = 0.3855781593833003, t1 = 3816.9, t2 = 850.46
pca_early_terminated = 9019
Iter 6 (CCD): obj = -114638.32618042963, δ = 0.09078804983710237, t1 = 118.88, t2 = 821.56, t3 = 0.0
ccd_early_terminated = 192
Iter 7 (PCA): obj = -112127.17150059695, δ = 0.3573975408848783, t1 = 5112.73, t2 = 1145.55
pca_early_terminated = 9019
Iter 8 (CCD): obj = -112033.49440467537, δ = 0.06394698256688378, t1 = 141.36, t2 = 1113.82, t3 = 0.0
ccd_early_terminated = 155
Iter 9 (PCA): obj = -110229.42080921704, δ = 0.2913788676608088, t1 = 6407.41, t2 = 1440.61
pca_early_terminated = 9019
Iter 10 (CCD): obj = -110175.8992055564, δ = 0.04113289361230646, t1 = 155.91, t2 = 1404.21, t3 = 0.0
ccd_early_terminated = 120
Iter 11 (PCA): obj = -108676.19894636273, δ = 0.19573378930103782, t1 = 7700.19, t2 = 1734.3
pca_early_terminated = 9019
Iter 12 (CCD): obj = -108657.18365041119, δ = 0.01959907766848471, t1 = 165.56, t2 = 1693.48, t3 = 0.0
ccd_early_terminated = 80
Iter 13 (PCA): obj = -107312.02842853116, δ = 0.18208840012112923, t1 = 8989.03, t2 = 2027.97
pca_early_terminated = 9019
Iter 14 (CCD): obj = -107302.33547962835, δ = 0.025094298874137874, t1 = 173.6, t2 = 1982.7, t3 = 0.0
ccd_early_terminated = 54
Iter 15 (PCA): obj = -106186.3408574712, δ = 0.22998709100773107, t1 = 10282.46, t2 = 2322.45
pca_early_terminated = 9019
Iter 16 (CCD): obj = -106182.00466931309, δ = 0.013124608119177414, t1 = 181.32, t2 = 2272.57, t3 = 0.0
ccd_early_terminated = 44
Iter 17 (PCA): obj = -105226.74312417053, δ = 0.18798423882462398, t1 = 11585.63, t2 = 2618.13
pca_early_terminated = 9019
Iter 18 (CCD): obj = -105225.00889733771, δ = 0.016869569663403902, t1 = 188.68, t2 = 2565.28, t3 = 0.0
ccd_early_terminated = 36
Iter 19 (PCA): obj = -104351.73259806576, δ = 0.19649363409471488, t1 = 12893.27, t2 = 2915.27
pca_early_terminated = 9019
Iter 20 (CCD): obj = -104351.0770831528, δ = 0.005042151160751619, t1 = 195.93, t2 = 2855.19, t3 = 0.0
ccd_early_terminated = 34
PCA timings: t1_pca = 12893.26639, t2_pca = 2915.26801, t3_pca = 0.0
CCD timings: t1_ccd = 195.93278, t2_ccd = 2855.18508, t3_ccd = 0.0021

This is weird: t1_pca for PCA takes >100x longer than t1_ccd even though it early terminates much more often.

biona001 commented 8 months ago

It turns out the reason that t1_ccd is much faster than t1_pca is because CCD updates were being skipped over very often (doesn't improve the objective). The reason for this is because of a small bug in the update to the diagonal entries of CCD updates. After fixing this, the CCD updates take equally long as PCA updates, but the objective improves much faster.

Julia v1.10 and soon-to-be Knockoffs.jl v1.1.8:

solve_s_graphical_group(
    Symmetric(Σ), groups, group_reps, :maxent; 
    m = 5, verbose = true, inner_pca_iter=1,
    inner_ccd_iter=1, outer_iter = 10
)

8999 representatives for 9132 variables, 9033 optimization variables
PCA optimization: 8999 variables
CCD optimization: 8999 diagonal variables
CCD optimization: 17 off-diagonal variables
Maxent initial obj = -130896.16543115964
Iter 1 (PCA): obj = -124299.86980415294, δ = 0.5286405884857434, t1 = 909.06, t2 = 264.1
pca_early_terminated = 9019
Iter 2 (CCD): obj = -118639.79956687437, δ = 0.495211713131239, t1 = 891.52, t2 = 260.2, t3 = 0.0
ccd_early_terminated = 9022
Iter 3 (PCA): obj = -114686.69186315064, δ = 0.45275292664408795, t1 = 1833.05, t2 = 546.13
pca_early_terminated = 9019
Iter 4 (CCD): obj = -112096.68739662363, δ = 0.32684850310884456, t1 = 1840.3, t2 = 546.12, t3 = 0.0

biona001 / Knockoffs.jl

Performance regression #68