FertigLab / CoGAPS

Bayesian MCMC matrix factorization algorithm
https://www.bioconductor.org/packages/release/bioc/html/CoGAPS.html
BSD 3-Clause "New" or "Revised" License
61 stars 17 forks source link

SparseOptimization pattern discrepancy #77

Open rpalaganas opened 8 months ago

rpalaganas commented 8 months ago

Good afternoon! I recently ran into an issue where there is pattern discrepancy between runs with sparseOptimization set to TRUE versus FALSE. The code I ran and the output is below. With sparseOptimization set to TRUE I noticed that the ChiSq value was -nan and during the equilibration phase, the P matrix was 0. With sparseOptimization set to FALSE there seemed to be no problems, however the number of patterns learned differed in either case, i.e. SparseOptimization = TRUE gave 5 patterns while SparseOptimization = FALSE gave 6 patterns. This was true for a range of patterns that I ran (5-50)

SPARSE OPTIMIZATION ENABLED

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, 
sparseOptimization=TRUE,
distributed="genome-wide")

params <- setDistributedParams(params, nSets=6)

Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          6 
cut            5 
minNS          3 
maxNS          9 

Creating subsets...
set sizes (min, mean, max): (5067, 5067.833, 5072)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 4 is starting!
    worker 6 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 13376(A), 1242(P), ChiSq: -nan, Time: 00:00:45 / 01:16:13
...
30000 of 30000, Atoms: 20636(A), 1461(P), ChiSq: -nan, Time: 00:35:40 / 01:16:38
-- Sampling Phase --
1000 of 30000, Atoms: 20671(A), 1460(P), ChiSq: -nan, Time: 00:36:54 / 01:16:28
...
29000 of 30000, Atoms: 20645(A), 1469(P), ChiSq: -nan, Time: 01:12:07 / 01:13:27
    worker 2 is finished! Time: 01:12:22
30000 of 30000, Atoms: 20670(A), 1484(P), ChiSq: -nan, Time: 01:13:21 / 01:13:21
    worker 1 is finished! Time: 01:13:21
    worker 3 is finished! Time: 01:13:24
    worker 5 is finished! Time: 01:15:26
    worker 4 is finished! Time: 01:15:26
    worker 6 is finished! Time: 01:19:08

Matching Patterns Across Subsets...
Running Final Stage...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 6 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 10022(A), 0(P), ChiSq: -nan, Time: 00:00:27 / 00:45:43
...
30000 of 30000, Atoms: 15174(A), 0(P), ChiSq: -nan, Time: 00:47:13 / 00:47:13
    worker 1 is finished! Time: 00:47:13
    worker 2 is finished! Time: 00:47:28
    worker 5 is finished! Time: 00:47:34
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

SPARSE OPTIMIZATION DISABLED

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42,
distributed="genome-wide")

params <- setDistributedParams(params, nSets=6)

Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   FALSE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          6 
cut            5 
minNS          3 
maxNS          9 

Creating subsets...
set sizes (min, mean, max): (5067, 5067.833, 5072)
Running Across Subsets...

    worker 2 is starting!
    worker 3 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 4 is starting!
    worker 5 is starting!
    worker 6 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 4665(A), 966(P), ChiSq: 5137063, Time: 00:01:16 / 02:08:43
...
30000 of 30000, Atoms: 9933(A), 2460(P), ChiSq: 4886798, Time: 00:49:52 / 01:47:09
-- Sampling Phase --
1000 of 30000, Atoms: 10033(A), 2514(P), ChiSq: 4886740, Time: 00:51:31 / 01:46:45
...
30000 of 30000, Atoms: 9953(A), 2489(P), ChiSq: 4886819, Time: 01:34:05 / 01:34:05
    worker 1 is finished! Time: 01:34:05
    worker 5 is finished! Time: 01:44:52
    worker 4 is finished! Time: 01:54:06
    worker 2 is finished! Time: 01:54:29
    worker 6 is finished! Time: 01:54:31
    worker 3 is finished! Time: 01:54:38

Matching Patterns Across Subsets...
Running Final Stage...

    worker 5 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 2 is starting!
    worker 6 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 5928(A), 0(P), ChiSq: 14908930, Time: 00:00:10 / 00:16:56
...
30000 of 30000, Atoms: 10469(A), 0(P), ChiSq: 14908930, Time: 00:08:47 / 00:18:52
-- Sampling Phase --
1000 of 30000, Atoms: 10403(A), 0(P), ChiSq: 14908930, Time: 00:09:00 / 00:18:39
...
30000 of 30000, Atoms: 10379(A), 0(P), ChiSq: 14908930, Time: 00:15:17 / 00:15:17
    worker 1 is finished! Time: 00:15:17
    worker 5 is finished! Time: 00:16:29
    worker 3 is finished! Time: 00:19:47
    worker 2 is finished! Time: 00:20:37
    worker 4 is finished! Time: 00:20:38
    worker 6 is finished! Time: 00:20:45
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

After obtaining the patterns, I ran patternMarkers on patterns learned with sparseOptimization = TRUE. When I set threshold = “all”, I would get this error.


test <- patternMarkers_all(Hoxd10_matnp5, threshold = "all")

Error in colnames(markerScores)[apply(markerScores, 1, which.min)] : 
  invalid subscript type 'list'
This error would not trigger when threshold was set to “cut”.
PatternMarkers worked normally when run on patterns learned without sparseOptimization. 

UPDATE @dimalvovs  - delete rows for readability
ejfertig commented 8 months ago

Are you filtering genes with zero expression? It’s notable that ChiSq is negative.

Get Outlook for iOShttps://aka.ms/o0ukef


From: rpalaganas @.> Sent: Wednesday, January 10, 2024 12:22:27 PM To: FertigLab/CoGAPS @.> Cc: Subscribed @.***> Subject: [FertigLab/CoGAPS] SparseOptimization pattern discrepancy (Issue #77)

Good afternoon! I recently ran into an issue where there is pattern discrepancy between runs with sparseOptimization set to TRUE versus FALSE. The code I ran and the output is below. With sparseOptimization set to TRUE I noticed that the ChiSq value was -nan and during the equilibration phase, the P matrix was 0. With sparseOptimization set to FALSE there seemed to be no problems, however the number of patterns learned differed in either case, i.e. SparseOptimization = TRUE gave 5 patterns while SparseOptimization = FALSE gave 6 patterns. This was true for a range of patterns that I ran (5-50)

SPARSE OPTIMIZATION ENABLED

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE, distributed="genome-wide")

params <- setDistributedParams(params, nSets=6)

Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:

-- Standard Parameters -- nPatterns 5 nIterations 30000 seed 42 sparseOptimization TRUE distributed genome-wide

-- Sparsity Parameters -- alpha 0.01 maxGibbsMass 100

-- Distributed CoGAPS Parameters -- nSets 6 cut 5 minNS 3 maxNS 9

Creating subsets... set sizes (min, mean, max): (5067, 5067.833, 5072) Running Across Subsets...

Data Model: Sparse, Normal Sampler Type: Sequential Loading Data...Done! (00:00:00) worker 1 is starting! worker 2 is starting! worker 4 is starting! worker 6 is starting! worker 3 is starting! worker 5 is starting! -- Equilibration Phase -- 1000 of 30000, Atoms: 13376(A), 1242(P), ChiSq: -nan, Time: 00:00:45 / 01:16:13 2000 of 30000, Atoms: 16106(A), 1260(P), ChiSq: -nan, Time: 00:01:49 / 01:22:36 3000 of 30000, Atoms: 17128(A), 1283(P), ChiSq: -nan, Time: 00:02:55 / 01:23:17 4000 of 30000, Atoms: 17792(A), 1349(P), ChiSq: -nan, Time: 00:04:02 / 01:22:58 5000 of 30000, Atoms: 18330(A), 1347(P), ChiSq: -nan, Time: 00:05:11 / 01:22:46 6000 of 30000, Atoms: 18639(A), 1351(P), ChiSq: -nan, Time: 00:06:20 / 01:22:16 7000 of 30000, Atoms: 19052(A), 1371(P), ChiSq: -nan, Time: 00:07:30 / 01:21:52 8000 of 30000, Atoms: 19157(A), 1356(P), ChiSq: -nan, Time: 00:08:42 / 01:21:42 9000 of 30000, Atoms: 19484(A), 1406(P), ChiSq: -nan, Time: 00:09:53 / 01:21:18 10000 of 30000, Atoms: 19616(A), 1439(P), ChiSq: -nan, Time: 00:11:05 / 01:21:01 11000 of 30000, Atoms: 19909(A), 1431(P), ChiSq: -nan, Time: 00:12:18 / 01:20:47 12000 of 30000, Atoms: 20044(A), 1442(P), ChiSq: -nan, Time: 00:13:31 / 01:20:32 13000 of 30000, Atoms: 20212(A), 1431(P), ChiSq: -nan, Time: 00:14:44 / 01:20:16 14000 of 30000, Atoms: 20338(A), 1433(P), ChiSq: -nan, Time: 00:15:57 / 01:19:59 15000 of 30000, Atoms: 20592(A), 1425(P), ChiSq: -nan, Time: 00:17:10 / 01:19:43 16000 of 30000, Atoms: 20532(A), 1439(P), ChiSq: -nan, Time: 00:18:24 / 01:19:30 17000 of 30000, Atoms: 20538(A), 1411(P), ChiSq: -nan, Time: 00:19:39 / 01:19:21 18000 of 30000, Atoms: 20627(A), 1413(P), ChiSq: -nan, Time: 00:20:54 / 01:19:12 19000 of 30000, Atoms: 20681(A), 1416(P), ChiSq: -nan, Time: 00:22:08 / 01:18:58 20000 of 30000, Atoms: 20596(A), 1430(P), ChiSq: -nan, Time: 00:23:21 / 01:18:41 21000 of 30000, Atoms: 20505(A), 1448(P), ChiSq: -nan, Time: 00:24:35 / 01:18:28 22000 of 30000, Atoms: 20471(A), 1451(P), ChiSq: -nan, Time: 00:25:49 / 01:18:15 23000 of 30000, Atoms: 20642(A), 1431(P), ChiSq: -nan, Time: 00:27:03 / 01:18:02 24000 of 30000, Atoms: 20576(A), 1432(P), ChiSq: -nan, Time: 00:28:16 / 01:17:47 25000 of 30000, Atoms: 20688(A), 1430(P), ChiSq: -nan, Time: 00:29:30 / 01:17:35 26000 of 30000, Atoms: 20671(A), 1434(P), ChiSq: -nan, Time: 00:30:44 / 01:17:23 27000 of 30000, Atoms: 20618(A), 1447(P), ChiSq: -nan, Time: 00:31:58 / 01:17:12 28000 of 30000, Atoms: 20643(A), 1434(P), ChiSq: -nan, Time: 00:33:12 / 01:17:00 29000 of 30000, Atoms: 20711(A), 1422(P), ChiSq: -nan, Time: 00:34:26 / 01:16:49 30000 of 30000, Atoms: 20636(A), 1461(P), ChiSq: -nan, Time: 00:35:40 / 01:16:38 -- Sampling Phase -- 1000 of 30000, Atoms: 20671(A), 1460(P), ChiSq: -nan, Time: 00:36:54 / 01:16:28 2000 of 30000, Atoms: 20618(A), 1465(P), ChiSq: -nan, Time: 00:38:09 / 01:16:19 3000 of 30000, Atoms: 20494(A), 1442(P), ChiSq: -nan, Time: 00:39:24 / 01:16:11 4000 of 30000, Atoms: 20716(A), 1466(P), ChiSq: -nan, Time: 00:40:38 / 01:16:01 5000 of 30000, Atoms: 20628(A), 1434(P), ChiSq: -nan, Time: 00:41:55 / 01:15:57 6000 of 30000, Atoms: 20625(A), 1449(P), ChiSq: -nan, Time: 00:43:13 / 01:15:54 7000 of 30000, Atoms: 20637(A), 1447(P), ChiSq: -nan, Time: 00:44:31 / 01:15:51 8000 of 30000, Atoms: 20557(A), 1478(P), ChiSq: -nan, Time: 00:45:50 / 01:15:49 9000 of 30000, Atoms: 20707(A), 1485(P), ChiSq: -nan, Time: 00:47:08 / 01:15:46 10000 of 30000, Atoms: 20689(A), 1438(P), ChiSq: -nan, Time: 00:48:25 / 01:15:41 11000 of 30000, Atoms: 20825(A), 1465(P), ChiSq: -nan, Time: 00:49:40 / 01:15:33 12000 of 30000, Atoms: 20607(A), 1460(P), ChiSq: -nan, Time: 00:50:55 / 01:15:25 13000 of 30000, Atoms: 20588(A), 1446(P), ChiSq: -nan, Time: 00:52:10 / 01:15:18 14000 of 30000, Atoms: 20595(A), 1443(P), ChiSq: -nan, Time: 00:53:24 / 01:15:09 15000 of 30000, Atoms: 20624(A), 1428(P), ChiSq: -nan, Time: 00:54:39 / 01:15:01 16000 of 30000, Atoms: 20586(A), 1435(P), ChiSq: -nan, Time: 00:55:53 / 01:14:52 17000 of 30000, Atoms: 20684(A), 1440(P), ChiSq: -nan, Time: 00:57:08 / 01:14:45 18000 of 30000, Atoms: 20730(A), 1456(P), ChiSq: -nan, Time: 00:58:23 / 01:14:38 19000 of 30000, Atoms: 20796(A), 1470(P), ChiSq: -nan, Time: 00:59:39 / 01:14:33 20000 of 30000, Atoms: 20701(A), 1493(P), ChiSq: -nan, Time: 01:00:53 / 01:14:25 21000 of 30000, Atoms: 20613(A), 1461(P), ChiSq: -nan, Time: 01:02:08 / 01:14:18 22000 of 30000, Atoms: 20701(A), 1486(P), ChiSq: -nan, Time: 01:03:24 / 01:14:13 23000 of 30000, Atoms: 20688(A), 1463(P), ChiSq: -nan, Time: 01:04:39 / 01:14:06 24000 of 30000, Atoms: 20581(A), 1466(P), ChiSq: -nan, Time: 01:05:53 / 01:13:59 25000 of 30000, Atoms: 20649(A), 1463(P), ChiSq: -nan, Time: 01:07:08 / 01:13:52 26000 of 30000, Atoms: 20539(A), 1469(P), ChiSq: -nan, Time: 01:08:23 / 01:13:46 27000 of 30000, Atoms: 20712(A), 1462(P), ChiSq: -nan, Time: 01:09:37 / 01:13:39 28000 of 30000, Atoms: 20668(A), 1479(P), ChiSq: -nan, Time: 01:10:52 / 01:13:33 29000 of 30000, Atoms: 20645(A), 1469(P), ChiSq: -nan, Time: 01:12:07 / 01:13:27 worker 2 is finished! Time: 01:12:22 30000 of 30000, Atoms: 20670(A), 1484(P), ChiSq: -nan, Time: 01:13:21 / 01:13:21 worker 1 is finished! Time: 01:13:21 worker 3 is finished! Time: 01:13:24 worker 5 is finished! Time: 01:15:26 worker 4 is finished! Time: 01:15:26 worker 6 is finished! Time: 01:19:08

Matching Patterns Across Subsets... Running Final Stage...

Data Model: Sparse, Normal Sampler Type: Sequential Loading Data...Done! (00:00:00) worker 1 is starting! worker 2 is starting! worker 6 is starting! worker 4 is starting! worker 3 is starting! worker 5 is starting! -- Equilibration Phase -- 1000 of 30000, Atoms: 10022(A), 0(P), ChiSq: -nan, Time: 00:00:27 / 00:45:43 2000 of 30000, Atoms: 11479(A), 0(P), ChiSq: -nan, Time: 00:01:03 / 00:47:44 3000 of 30000, Atoms: 12276(A), 0(P), ChiSq: -nan, Time: 00:01:41 / 00:48:04 4000 of 30000, Atoms: 12769(A), 0(P), ChiSq: -nan, Time: 00:02:20 / 00:48:00 5000 of 30000, Atoms: 13197(A), 0(P), ChiSq: -nan, Time: 00:03:00 / 00:47:54 6000 of 30000, Atoms: 13532(A), 0(P), ChiSq: -nan, Time: 00:03:41 / 00:47:51 7000 of 30000, Atoms: 13666(A), 0(P), ChiSq: -nan, Time: 00:04:22 / 00:47:40 8000 of 30000, Atoms: 13951(A), 0(P), ChiSq: -nan, Time: 00:05:04 / 00:47:35 9000 of 30000, Atoms: 14232(A), 0(P), ChiSq: -nan, Time: 00:05:47 / 00:47:34 10000 of 30000, Atoms: 14359(A), 0(P), ChiSq: -nan, Time: 00:06:29 / 00:47:23 11000 of 30000, Atoms: 14621(A), 0(P), ChiSq: -nan, Time: 00:07:12 / 00:47:17 12000 of 30000, Atoms: 14662(A), 0(P), ChiSq: -nan, Time: 00:07:56 / 00:47:16 13000 of 30000, Atoms: 14816(A), 0(P), ChiSq: -nan, Time: 00:08:39 / 00:47:07 14000 of 30000, Atoms: 14864(A), 0(P), ChiSq: -nan, Time: 00:09:25 / 00:47:13 15000 of 30000, Atoms: 15042(A), 0(P), ChiSq: -nan, Time: 00:10:18 / 00:47:49 16000 of 30000, Atoms: 15118(A), 0(P), ChiSq: -nan, Time: 00:11:12 / 00:48:23 17000 of 30000, Atoms: 15167(A), 0(P), ChiSq: -nan, Time: 00:12:05 / 00:48:48 18000 of 30000, Atoms: 15174(A), 0(P), ChiSq: -nan, Time: 00:12:59 / 00:49:12 19000 of 30000, Atoms: 15163(A), 0(P), ChiSq: -nan, Time: 00:13:52 / 00:49:28 20000 of 30000, Atoms: 15057(A), 0(P), ChiSq: -nan, Time: 00:14:45 / 00:49:42 21000 of 30000, Atoms: 15151(A), 0(P), ChiSq: -nan, Time: 00:15:37 / 00:49:51 22000 of 30000, Atoms: 15116(A), 0(P), ChiSq: -nan, Time: 00:16:29 / 00:49:58 23000 of 30000, Atoms: 14997(A), 0(P), ChiSq: -nan, Time: 00:17:20 / 00:50:00 24000 of 30000, Atoms: 15199(A), 0(P), ChiSq: -nan, Time: 00:18:11 / 00:50:02 25000 of 30000, Atoms: 15141(A), 0(P), ChiSq: -nan, Time: 00:19:02 / 00:50:03 26000 of 30000, Atoms: 15071(A), 0(P), ChiSq: -nan, Time: 00:19:46 / 00:49:46 27000 of 30000, Atoms: 15179(A), 0(P), ChiSq: -nan, Time: 00:20:31 / 00:49:32 28000 of 30000, Atoms: 15099(A), 0(P), ChiSq: -nan, Time: 00:21:15 / 00:49:17 29000 of 30000, Atoms: 15177(A), 0(P), ChiSq: -nan, Time: 00:22:00 / 00:49:05 30000 of 30000, Atoms: 15126(A), 0(P), ChiSq: -nan, Time: 00:22:44 / 00:48:51 -- Sampling Phase -- 1000 of 30000, Atoms: 15203(A), 0(P), ChiSq: -nan, Time: 00:23:29 / 00:48:39 2000 of 30000, Atoms: 15156(A), 0(P), ChiSq: -nan, Time: 00:24:14 / 00:48:29 3000 of 30000, Atoms: 15221(A), 0(P), ChiSq: -nan, Time: 00:24:58 / 00:48:16 4000 of 30000, Atoms: 15172(A), 0(P), ChiSq: -nan, Time: 00:25:43 / 00:48:06 5000 of 30000, Atoms: 15299(A), 0(P), ChiSq: -nan, Time: 00:26:28 / 00:47:57 6000 of 30000, Atoms: 15111(A), 0(P), ChiSq: -nan, Time: 00:27:13 / 00:47:48 7000 of 30000, Atoms: 15172(A), 0(P), ChiSq: -nan, Time: 00:27:58 / 00:47:39 8000 of 30000, Atoms: 15091(A), 0(P), ChiSq: -nan, Time: 00:28:42 / 00:47:29 9000 of 30000, Atoms: 15083(A), 0(P), ChiSq: -nan, Time: 00:29:27 / 00:47:20 10000 of 30000, Atoms: 15126(A), 0(P), ChiSq: -nan, Time: 00:30:12 / 00:47:12 11000 of 30000, Atoms: 15115(A), 0(P), ChiSq: -nan, Time: 00:30:56 / 00:47:03 12000 of 30000, Atoms: 15152(A), 0(P), ChiSq: -nan, Time: 00:31:46 / 00:47:03 13000 of 30000, Atoms: 15181(A), 0(P), ChiSq: -nan, Time: 00:32:40 / 00:47:09 14000 of 30000, Atoms: 15125(A), 0(P), ChiSq: -nan, Time: 00:33:34 / 00:47:14 15000 of 30000, Atoms: 15193(A), 0(P), ChiSq: -nan, Time: 00:34:28 / 00:47:19 16000 of 30000, Atoms: 15146(A), 0(P), ChiSq: -nan, Time: 00:35:21 / 00:47:22 17000 of 30000, Atoms: 15143(A), 0(P), ChiSq: -nan, Time: 00:36:15 / 00:47:26 18000 of 30000, Atoms: 15155(A), 0(P), ChiSq: -nan, Time: 00:37:07 / 00:47:27 19000 of 30000, Atoms: 15201(A), 0(P), ChiSq: -nan, Time: 00:38:00 / 00:47:29 20000 of 30000, Atoms: 15142(A), 0(P), ChiSq: -nan, Time: 00:38:52 / 00:47:30 21000 of 30000, Atoms: 15243(A), 0(P), ChiSq: -nan, Time: 00:39:43 / 00:47:29 22000 of 30000, Atoms: 15220(A), 0(P), ChiSq: -nan, Time: 00:40:35 / 00:47:30 23000 of 30000, Atoms: 15173(A), 0(P), ChiSq: -nan, Time: 00:41:26 / 00:47:29 24000 of 30000, Atoms: 15192(A), 0(P), ChiSq: -nan, Time: 00:42:16 / 00:47:27 25000 of 30000, Atoms: 15186(A), 0(P), ChiSq: -nan, Time: 00:43:06 / 00:47:25 26000 of 30000, Atoms: 15160(A), 0(P), ChiSq: -nan, Time: 00:43:55 / 00:47:22 27000 of 30000, Atoms: 15284(A), 0(P), ChiSq: -nan, Time: 00:44:45 / 00:47:20 worker 3 is finished! Time: 00:45:34 28000 of 30000, Atoms: 15238(A), 0(P), ChiSq: -nan, Time: 00:45:35 / 00:47:18 worker 4 is finished! Time: 00:46:23 29000 of 30000, Atoms: 15219(A), 0(P), ChiSq: -nan, Time: 00:46:24 / 00:47:16 worker 6 is finished! Time: 00:47:10 30000 of 30000, Atoms: 15174(A), 0(P), ChiSq: -nan, Time: 00:47:13 / 00:47:13 worker 1 is finished! Time: 00:47:13 worker 2 is finished! Time: 00:47:28 worker 5 is finished! Time: 00:47:34 Warning message: In checkInputs(data, uncertainty, allParams) : running distributed cogaps without mtx/tsv/csv/gct data

SPARSE OPTIMIZATION DISABLED

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, distributed="genome-wide")

params <- setDistributedParams(params, nSets=6)

Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:

-- Standard Parameters -- nPatterns 5 nIterations 30000 seed 42 sparseOptimization FALSE distributed genome-wide

-- Sparsity Parameters -- alpha 0.01 maxGibbsMass 100

-- Distributed CoGAPS Parameters -- nSets 6 cut 5 minNS 3 maxNS 9

Creating subsets... set sizes (min, mean, max): (5067, 5067.833, 5072) Running Across Subsets...

worker 2 is starting!
worker 3 is starting!

Data Model: Dense, Normal Sampler Type: Sequential Loading Data...Done! (00:00:00) worker 1 is starting! worker 4 is starting! worker 5 is starting! worker 6 is starting! -- Equilibration Phase -- 1000 of 30000, Atoms: 4665(A), 966(P), ChiSq: 5137063, Time: 00:01:16 / 02:08:43 2000 of 30000, Atoms: 5336(A), 1572(P), ChiSq: 4995452, Time: 00:02:48 / 02:07:18 3000 of 30000, Atoms: 6053(A), 1918(P), ChiSq: 4954336, Time: 00:04:33 / 02:09:55 4000 of 30000, Atoms: 6773(A), 2085(P), ChiSq: 4934452, Time: 00:06:21 / 02:10:37 5000 of 30000, Atoms: 7298(A), 2185(P), ChiSq: 4922336, Time: 00:07:57 / 02:06:56 6000 of 30000, Atoms: 7738(A), 2254(P), ChiSq: 4913117, Time: 00:09:34 / 02:04:17 7000 of 30000, Atoms: 7975(A), 2267(P), ChiSq: 4907556, Time: 00:11:13 / 02:02:27 8000 of 30000, Atoms: 8336(A), 2298(P), ChiSq: 4902574, Time: 00:12:52 / 02:00:51 9000 of 30000, Atoms: 8594(A), 2319(P), ChiSq: 4899279, Time: 00:14:31 / 01:59:26 10000 of 30000, Atoms: 8896(A), 2391(P), ChiSq: 4896176, Time: 00:16:12 / 01:58:25 11000 of 30000, Atoms: 9176(A), 2420(P), ChiSq: 4893982, Time: 00:17:54 / 01:57:35 12000 of 30000, Atoms: 9445(A), 2430(P), ChiSq: 4891372, Time: 00:19:36 / 01:56:47 13000 of 30000, Atoms: 9672(A), 2456(P), ChiSq: 4890006, Time: 00:21:18 / 01:56:03 14000 of 30000, Atoms: 9830(A), 2482(P), ChiSq: 4888700, Time: 00:23:02 / 01:55:31 15000 of 30000, Atoms: 9974(A), 2487(P), ChiSq: 4886906, Time: 00:24:45 / 01:54:56 16000 of 30000, Atoms: 10077(A), 2447(P), ChiSq: 4887154, Time: 00:26:29 / 01:54:26 17000 of 30000, Atoms: 10078(A), 2473(P), ChiSq: 4886879, Time: 00:28:12 / 01:53:53 18000 of 30000, Atoms: 10051(A), 2493(P), ChiSq: 4886432, Time: 00:29:55 / 01:53:22 19000 of 30000, Atoms: 10066(A), 2448(P), ChiSq: 4886908, Time: 00:31:39 / 01:52:56 20000 of 30000, Atoms: 10114(A), 2509(P), ChiSq: 4886625, Time: 00:33:23 / 01:52:30 21000 of 30000, Atoms: 10015(A), 2499(P), ChiSq: 4887112, Time: 00:35:04 / 01:51:56 22000 of 30000, Atoms: 10140(A), 2471(P), ChiSq: 4886580, Time: 00:36:42 / 01:51:15 23000 of 30000, Atoms: 10087(A), 2486(P), ChiSq: 4886636, Time: 00:38:21 / 01:50:39 24000 of 30000, Atoms: 10067(A), 2510(P), ChiSq: 4887080, Time: 00:40:00 / 01:50:05 25000 of 30000, Atoms: 10029(A), 2531(P), ChiSq: 4886377, Time: 00:41:38 / 01:49:30 26000 of 30000, Atoms: 10049(A), 2488(P), ChiSq: 4887044, Time: 00:43:17 / 01:49:00 27000 of 30000, Atoms: 9991(A), 2494(P), ChiSq: 4886824, Time: 00:44:56 / 01:48:31 28000 of 30000, Atoms: 10019(A), 2502(P), ChiSq: 4887262, Time: 00:46:34 / 01:48:01 29000 of 30000, Atoms: 10085(A), 2506(P), ChiSq: 4886958, Time: 00:48:13 / 01:47:34 30000 of 30000, Atoms: 9933(A), 2460(P), ChiSq: 4886798, Time: 00:49:52 / 01:47:09 -- Sampling Phase -- 1000 of 30000, Atoms: 10033(A), 2514(P), ChiSq: 4886740, Time: 00:51:31 / 01:46:45 2000 of 30000, Atoms: 9989(A), 2494(P), ChiSq: 4886868, Time: 00:53:10 / 01:46:22 3000 of 30000, Atoms: 10105(A), 2526(P), ChiSq: 4886859, Time: 00:54:49 / 01:46:00 4000 of 30000, Atoms: 10055(A), 2479(P), ChiSq: 4886471, Time: 00:56:28 / 01:45:38 5000 of 30000, Atoms: 10075(A), 2534(P), ChiSq: 4886635, Time: 00:58:07 / 01:45:18 6000 of 30000, Atoms: 10086(A), 2499(P), ChiSq: 4887080, Time: 00:59:46 / 01:44:58 7000 of 30000, Atoms: 10015(A), 2535(P), ChiSq: 4886512, Time: 01:01:25 / 01:44:39 8000 of 30000, Atoms: 10083(A), 2539(P), ChiSq: 4886850, Time: 01:02:47 / 01:43:52 9000 of 30000, Atoms: 10084(A), 2491(P), ChiSq: 4887106, Time: 01:04:08 / 01:43:06 10000 of 30000, Atoms: 9993(A), 2546(P), ChiSq: 4887135, Time: 01:05:29 / 01:42:22 11000 of 30000, Atoms: 10005(A), 2534(P), ChiSq: 4887056, Time: 01:06:50 / 01:41:40 12000 of 30000, Atoms: 10041(A), 2547(P), ChiSq: 4887020, Time: 01:08:11 / 01:41:00 13000 of 30000, Atoms: 10045(A), 2481(P), ChiSq: 4887188, Time: 01:09:32 / 01:40:22 14000 of 30000, Atoms: 10055(A), 2539(P), ChiSq: 4886859, Time: 01:10:53 / 01:39:45 15000 of 30000, Atoms: 10036(A), 2567(P), ChiSq: 4887087, Time: 01:12:14 / 01:39:09 16000 of 30000, Atoms: 9985(A), 2498(P), ChiSq: 4886410, Time: 01:13:36 / 01:38:37 17000 of 30000, Atoms: 10057(A), 2519(P), ChiSq: 4886582, Time: 01:15:04 / 01:38:13 18000 of 30000, Atoms: 10071(A), 2527(P), ChiSq: 4887043, Time: 01:16:32 / 01:37:51 19000 of 30000, Atoms: 10106(A), 2556(P), ChiSq: 4886944, Time: 01:18:00 / 01:37:29 20000 of 30000, Atoms: 10112(A), 2527(P), ChiSq: 4887032, Time: 01:19:28 / 01:37:07 21000 of 30000, Atoms: 10021(A), 2532(P), ChiSq: 4887194, Time: 01:20:56 / 01:36:47 22000 of 30000, Atoms: 10107(A), 2564(P), ChiSq: 4886800, Time: 01:22:24 / 01:36:27 23000 of 30000, Atoms: 10078(A), 2541(P), ChiSq: 4886892, Time: 01:23:52 / 01:36:08 24000 of 30000, Atoms: 10044(A), 2533(P), ChiSq: 4887120, Time: 01:25:19 / 01:35:48 25000 of 30000, Atoms: 10116(A), 2498(P), ChiSq: 4886992, Time: 01:26:47 / 01:35:30 26000 of 30000, Atoms: 10056(A), 2490(P), ChiSq: 4886794, Time: 01:28:15 / 01:35:12 27000 of 30000, Atoms: 10134(A), 2494(P), ChiSq: 4886833, Time: 01:29:42 / 01:34:54 28000 of 30000, Atoms: 9968(A), 2510(P), ChiSq: 4887242, Time: 01:31:10 / 01:34:37 29000 of 30000, Atoms: 10069(A), 2502(P), ChiSq: 4886577, Time: 01:32:37 / 01:34:20 30000 of 30000, Atoms: 9953(A), 2489(P), ChiSq: 4886819, Time: 01:34:05 / 01:34:05 worker 1 is finished! Time: 01:34:05 worker 5 is finished! Time: 01:44:52 worker 4 is finished! Time: 01:54:06 worker 2 is finished! Time: 01:54:29 worker 6 is finished! Time: 01:54:31 worker 3 is finished! Time: 01:54:38

Matching Patterns Across Subsets... Running Final Stage...

worker 5 is starting!
worker 4 is starting!
worker 3 is starting!
worker 2 is starting!
worker 6 is starting!

Data Model: Dense, Normal Sampler Type: Sequential Loading Data...Done! (00:00:00) worker 1 is starting! -- Equilibration Phase -- 1000 of 30000, Atoms: 5928(A), 0(P), ChiSq: 14908930, Time: 00:00:10 / 00:16:56 2000 of 30000, Atoms: 7023(A), 0(P), ChiSq: 14908930, Time: 00:00:25 / 00:18:56 3000 of 30000, Atoms: 7726(A), 0(P), ChiSq: 14908930, Time: 00:00:41 / 00:19:30 4000 of 30000, Atoms: 8082(A), 0(P), ChiSq: 14908930, Time: 00:00:58 / 00:19:53 5000 of 30000, Atoms: 8496(A), 0(P), ChiSq: 14908930, Time: 00:01:16 / 00:20:13 6000 of 30000, Atoms: 8718(A), 0(P), ChiSq: 14908930, Time: 00:01:34 / 00:20:21 7000 of 30000, Atoms: 8994(A), 0(P), ChiSq: 14908930, Time: 00:01:53 / 00:20:33 8000 of 30000, Atoms: 9211(A), 0(P), ChiSq: 14908930, Time: 00:02:13 / 00:20:49 9000 of 30000, Atoms: 9400(A), 0(P), ChiSq: 14908930, Time: 00:02:33 / 00:20:58 10000 of 30000, Atoms: 9600(A), 0(P), ChiSq: 14908930, Time: 00:02:53 / 00:21:04 11000 of 30000, Atoms: 9735(A), 0(P), ChiSq: 14908930, Time: 00:03:14 / 00:21:14 12000 of 30000, Atoms: 9853(A), 0(P), ChiSq: 14908930, Time: 00:03:35 / 00:21:21 13000 of 30000, Atoms: 10025(A), 0(P), ChiSq: 14908930, Time: 00:03:57 / 00:21:31 14000 of 30000, Atoms: 10229(A), 0(P), ChiSq: 14908930, Time: 00:04:18 / 00:21:34 15000 of 30000, Atoms: 10315(A), 0(P), ChiSq: 14908930, Time: 00:04:40 / 00:21:40 16000 of 30000, Atoms: 10331(A), 0(P), ChiSq: 14908930, Time: 00:05:03 / 00:21:49 17000 of 30000, Atoms: 10359(A), 0(P), ChiSq: 14908930, Time: 00:05:25 / 00:21:52 18000 of 30000, Atoms: 10353(A), 0(P), ChiSq: 14908930, Time: 00:05:47 / 00:21:54 19000 of 30000, Atoms: 10302(A), 0(P), ChiSq: 14908930, Time: 00:06:09 / 00:21:56 20000 of 30000, Atoms: 10407(A), 0(P), ChiSq: 14908930, Time: 00:06:31 / 00:21:57 21000 of 30000, Atoms: 10354(A), 0(P), ChiSq: 14908930, Time: 00:06:53 / 00:21:58 22000 of 30000, Atoms: 10263(A), 0(P), ChiSq: 14908930, Time: 00:07:08 / 00:21:37 23000 of 30000, Atoms: 10294(A), 0(P), ChiSq: 14908930, Time: 00:07:22 / 00:21:15 24000 of 30000, Atoms: 10435(A), 0(P), ChiSq: 14908930, Time: 00:07:34 / 00:20:49 25000 of 30000, Atoms: 10340(A), 0(P), ChiSq: 14908930, Time: 00:07:46 / 00:20:25 26000 of 30000, Atoms: 10369(A), 0(P), ChiSq: 14908930, Time: 00:07:58 / 00:20:03 27000 of 30000, Atoms: 10358(A), 0(P), ChiSq: 14908930, Time: 00:08:10 / 00:19:43 28000 of 30000, Atoms: 10344(A), 0(P), ChiSq: 14908930, Time: 00:08:23 / 00:19:26 29000 of 30000, Atoms: 10374(A), 0(P), ChiSq: 14908930, Time: 00:08:35 / 00:19:09 30000 of 30000, Atoms: 10469(A), 0(P), ChiSq: 14908930, Time: 00:08:47 / 00:18:52 -- Sampling Phase -- 1000 of 30000, Atoms: 10403(A), 0(P), ChiSq: 14908930, Time: 00:09:00 / 00:18:39 2000 of 30000, Atoms: 10386(A), 0(P), ChiSq: 14908930, Time: 00:09:13 / 00:18:26 3000 of 30000, Atoms: 10370(A), 0(P), ChiSq: 14908930, Time: 00:09:26 / 00:18:14 4000 of 30000, Atoms: 10378(A), 0(P), ChiSq: 14908930, Time: 00:09:39 / 00:18:03 5000 of 30000, Atoms: 10296(A), 0(P), ChiSq: 14908930, Time: 00:09:52 / 00:17:52 6000 of 30000, Atoms: 10343(A), 0(P), ChiSq: 14908930, Time: 00:10:05 / 00:17:42 7000 of 30000, Atoms: 10357(A), 0(P), ChiSq: 14908930, Time: 00:10:19 / 00:17:34 8000 of 30000, Atoms: 10301(A), 0(P), ChiSq: 14908930, Time: 00:10:31 / 00:17:24 9000 of 30000, Atoms: 10242(A), 0(P), ChiSq: 14908930, Time: 00:10:44 / 00:17:15 10000 of 30000, Atoms: 10355(A), 0(P), ChiSq: 14908930, Time: 00:10:57 / 00:17:07 11000 of 30000, Atoms: 10280(A), 0(P), ChiSq: 14908930, Time: 00:11:10 / 00:16:59 12000 of 30000, Atoms: 10422(A), 0(P), ChiSq: 14908930, Time: 00:11:23 / 00:16:51 13000 of 30000, Atoms: 10369(A), 0(P), ChiSq: 14908930, Time: 00:11:36 / 00:16:44 14000 of 30000, Atoms: 10388(A), 0(P), ChiSq: 14908930, Time: 00:11:49 / 00:16:37 15000 of 30000, Atoms: 10250(A), 0(P), ChiSq: 14908930, Time: 00:12:02 / 00:16:31 16000 of 30000, Atoms: 10434(A), 0(P), ChiSq: 14908930, Time: 00:12:15 / 00:16:24 17000 of 30000, Atoms: 10371(A), 0(P), ChiSq: 14908930, Time: 00:12:28 / 00:16:18 18000 of 30000, Atoms: 10377(A), 0(P), ChiSq: 14908930, Time: 00:12:41 / 00:16:12 19000 of 30000, Atoms: 10382(A), 0(P), ChiSq: 14908930, Time: 00:12:54 / 00:16:07 20000 of 30000, Atoms: 10333(A), 0(P), ChiSq: 14908930, Time: 00:13:07 / 00:16:01 21000 of 30000, Atoms: 10395(A), 0(P), ChiSq: 14908930, Time: 00:13:20 / 00:15:56 22000 of 30000, Atoms: 10385(A), 0(P), ChiSq: 14908930, Time: 00:13:33 / 00:15:51 23000 of 30000, Atoms: 10361(A), 0(P), ChiSq: 14908930, Time: 00:13:46 / 00:15:46 24000 of 30000, Atoms: 10239(A), 0(P), ChiSq: 14908930, Time: 00:13:59 / 00:15:42 25000 of 30000, Atoms: 10390(A), 0(P), ChiSq: 14908930, Time: 00:14:12 / 00:15:37 26000 of 30000, Atoms: 10356(A), 0(P), ChiSq: 14908930, Time: 00:14:25 / 00:15:33 27000 of 30000, Atoms: 10397(A), 0(P), ChiSq: 14908930, Time: 00:14:38 / 00:15:28 28000 of 30000, Atoms: 10272(A), 0(P), ChiSq: 14908930, Time: 00:14:51 / 00:15:24 29000 of 30000, Atoms: 10388(A), 0(P), ChiSq: 14908930, Time: 00:15:04 / 00:15:20 30000 of 30000, Atoms: 10379(A), 0(P), ChiSq: 14908930, Time: 00:15:17 / 00:15:17 worker 1 is finished! Time: 00:15:17 worker 5 is finished! Time: 00:16:29 worker 3 is finished! Time: 00:19:47 worker 2 is finished! Time: 00:20:37 worker 4 is finished! Time: 00:20:38 worker 6 is finished! Time: 00:20:45 Warning message: In checkInputs(data, uncertainty, allParams) : running distributed cogaps without mtx/tsv/csv/gct data

After obtaining the patterns, I ran patternMarkers on patterns learned with sparseOptimization = TRUE. When I set threshold = “all”, I would get this error.

test <- patternMarkers_all(Hoxd10_matnp5, threshold = "all")

Error in colnames(markerScores)[apply(markerScores, 1, which.min)] : invalid subscript type 'list'

This error would not trigger when threshold was set to “cut”. PatternMarkers worked normally when run on patterns learned without sparseOptimization.

— Reply to this email directly, view it on GitHubhttps://github.com/FertigLab/CoGAPS/issues/77, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATMMK3XOVXMQNS4N4DXWTLYN3E5HAVCNFSM6AAAAABBVE7NLGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3TIOBTGM4TKMI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

rpalaganas commented 8 months ago

I did not filter genes with zero expression

ejfertig commented 8 months ago

Can you let us know what happens and the difference if you do that?

Get Outlook for iOShttps://aka.ms/o0ukef


From: rpalaganas @.> Sent: Wednesday, January 10, 2024 1:02:53 PM To: FertigLab/CoGAPS @.> Cc: Elana Fertig @.>; Comment @.> Subject: Re: [FertigLab/CoGAPS] SparseOptimization pattern discrepancy (Issue #77)

I did not filter genes with zero expression

— Reply to this email directly, view it on GitHubhttps://github.com/FertigLab/CoGAPS/issues/77#issuecomment-1885358435, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATMMK5SEXEXSCKJ5XYQ7HLYN3JU3AVCNFSM6AAAAABBVE7NLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBVGM2TQNBTGU. You are receiving this because you commented.Message ID: @.***>

rpalaganas commented 7 months ago

Good morning, Sorry for the delay. After removing zero variance and zero expression genes with

#drop zero expression genes 
row_sums <- rowSums(Hoxd10.mat)
zeroindx <- which(row_sums == 0)
Hoxd10mat_filtered <- Hoxd10.mat[-zeroindx,]

#drop zero variance genes 
row_var <- rowVars(Hoxd10mat_filtered)
zervarindx <- which(row_var == 0)
Hoxd10mat_filtered <- Hoxd10mat_filtered[-zervarindx,]

I reran CoGAPS with and without sparse optimization. Looks like it gave a similar result with -nan ChiSq

> Hoxd10_mat <- readRDS('~Hoxd10mat_filtered.RDS')

> params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE, distributed="genome-wide")
> params <- setDistributedParams(params, nSets=5)
setting distributed parameters - call this again if you change nPatterns

> Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 16620(A), 1167(P), ChiSq: -nan, Time: 00:00:52 / 01:28:04
...
30000 of 30000, Atoms: 25197(A), 1328(P), ChiSq: -nan, Time: 00:41:15 / 01:28:38
-- Sampling Phase --
1000 of 30000, Atoms: 25152(A), 1329(P), ChiSq: -nan, Time: 00:42:42 / 01:28:29
...
30000 of 30000, Atoms: 25114(A), 1344(P), ChiSq: -nan, Time: 01:24:45 / 01:24:45
    worker 1 is finished! Time: 01:24:45
    worker 2 is finished! Time: 01:24:49
    worker 3 is finished! Time: 01:25:26
    worker 5 is finished! Time: 01:28:35
    worker 4 is finished! Time: 01:30:20

Matching Patterns Across Subsets...
Running Final Stage...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 5 is starting!
    worker 3 is starting!
    worker 4 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 12640(A), 0(P), ChiSq: -nan, Time: 00:00:34 / 00:57:35
...
30000 of 30000, Atoms: 19022(A), 0(P), ChiSq: -nan, Time: 00:25:15 / 00:54:15
-- Sampling Phase --
1000 of 30000, Atoms: 19126(A), 0(P), ChiSq: -nan, Time: 00:26:07 / 00:54:07
...
30000 of 30000, Atoms: 18941(A), 0(P), ChiSq: -nan, Time: 00:51:43 / 00:51:43
    worker 1 is finished! Time: 00:51:43
    worker 3 is finished! Time: 00:52:06
    worker 5 is finished! Time: 00:52:45
    worker 2 is finished! Time: 00:52:56
    worker 4 is finished! Time: 00:53:10
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

While running without sparse optimization looked normal.

> Hoxd10_mat <- readRDS('~Hoxd10mat_filtered.RDS')
> params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, distributed="genome-wide")
> params <- setDistributedParams(params, nSets=5)
setting distributed parameters - call this again if you change nPatterns

> Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   FALSE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

    worker 2 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 5351(A), 1181(P), ChiSq: 6223970, Time: 00:01:14 / 02:05:20
...
30000 of 30000, Atoms: 12093(A), 2528(P), ChiSq: 5907389, Time: 00:55:27 / 01:59:09
-- Sampling Phase --
1000 of 30000, Atoms: 12133(A), 2524(P), ChiSq: 5907130, Time: 00:57:20 / 01:58:48
...
22000 of 30000, Atoms: 12115(A), 2522(P), ChiSq: 5906996, Time: 01:38:09 / 01:54:53
    worker 5 is finished! Time: 01:39:32
23000 of 30000, Atoms: 12027(A), 2495(P), ChiSq: 5907204, Time: 01:40:02 / 01:54:40
24000 of 30000, Atoms: 12127(A), 2530(P), ChiSq: 5906893, Time: 01:41:46 / 01:54:16
25000 of 30000, Atoms: 12067(A), 2549(P), ChiSq: 5907026, Time: 01:43:30 / 01:53:54
26000 of 30000, Atoms: 12093(A), 2520(P), ChiSq: 5907063, Time: 01:45:13 / 01:53:31
    worker 4 is finished! Time: 01:46:33
27000 of 30000, Atoms: 12146(A), 2533(P), ChiSq: 5907351, Time: 01:46:57 / 01:53:09
    worker 3 is finished! Time: 01:48:18
28000 of 30000, Atoms: 12094(A), 2504(P), ChiSq: 5906844, Time: 01:48:37 / 01:52:44
    worker 2 is finished! Time: 01:49:34
29000 of 30000, Atoms: 12145(A), 2491(P), ChiSq: 5907016, Time: 01:50:00 / 01:52:03
30000 of 30000, Atoms: 12155(A), 2493(P), ChiSq: 5907008, Time: 01:51:34 / 01:51:34
    worker 1 is finished! Time: 01:51:34

Matching Patterns Across Subsets...
Running Final Stage...

    worker 5 is starting!
    worker 3 is starting!
    worker 4 is starting!
    worker 2 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 6726(A), 0(P), ChiSq: 18248056, Time: 00:00:08 / 00:13:33
...
30000 of 30000, Atoms: 11486(A), 0(P), ChiSq: 18248056, Time: 00:07:05 / 00:15:13
-- Sampling Phase --
1000 of 30000, Atoms: 11367(A), 0(P), ChiSq: 18248056, Time: 00:07:20 / 00:15:11
...
30000 of 30000, Atoms: 11467(A), 0(P), ChiSq: 18248056, Time: 00:14:32 / 00:14:32
    worker 1 is finished! Time: 00:14:32
    worker 2 is finished! Time: 00:15:12
    worker 5 is finished! Time: 00:16:59
    worker 3 is finished! Time: 00:17:06
    worker 4 is finished! Time: 00:17:30
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

This time, each run generated the same number of patterns, however the values differed.

> range(sparseTRUE@featureLoadings)
[1] 0.000000 9.607491
> range(sparseFALSE@featureLoadings)
[1] 7.684777e-09 5.845305e+00

PatternMarkers with threshold = 'all' also did not work on the CoGAPS object generated with sparseOptimization = "TRUE". PatternMarkers worked on the object generated without sparse optimization.

ejfertig commented 7 months ago

Thanks! We will look into this and get back to you.

UPDATE @dimalvovs deleted quoted rows for readability

dimalvovs commented 6 months ago

Chisq is still not nan if we run on exact same dimensions and parameters

c <- 380
r <- 27277
simdata <- matrix(runif(r*c), nrow=r, ncol=c)
params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, distributed="genome-wide", sparseOptimization=TRUE)
params <- setDistributedParams(params, nSets=5)
res <- CoGAPS(simdata, params = params, outputFrequency = 1000)

This is CoGAPS version 3.22.0 
Running genome-wide CoGAPS on simdata (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 3 is starting!
-- Equilibration Phase --
    worker 4 is starting!
    worker 2 is starting!
    worker 5 is starting!
1000 of 30000, Atoms: 28296(A), 1016(P), ChiSq: 171672640, Time: 00:00:55 / 01:33:09
2000 of 30000, Atoms: 31014(A), 983(P), ChiSq: 171320192, Time: 00:02:02 / 01:32:27
3000 of 30000, Atoms: 32194(A), 934(P), ChiSq: 171210704, Time: 00:03:07 / 01:28:59
4000 of 30000, Atoms: 33225(A), 922(P), ChiSq: 171167024, Time: 00:04:12 / 01:26:24
dimalvovs commented 6 months ago

Making data 50% sparse still runs fine. @rpalaganas what's the sparsity of your data?

c <- 380
r <- 27277
dense <- runif(r*c)
sparse <- sample(c(dense, rep(0, length(dense))),
                size = length(dense), replace = T)
sum(sparse==0)/length(sparse)
simdata <- matrix(dense, nrow=r, ncol=c)
params <- CogapsParams(nPatterns = 5, nIterations = 30000, seed = 42, distributed = "genome-wide", 
                       sparseOptimization = TRUE)
params <- setDistributedParams(params, nSets=5)
res <- CoGAPS(simdata, params = params, outputFrequency = 1000)
This is CoGAPS version 3.22.0 
Running genome-wide CoGAPS on simdata (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
    worker 2 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 5 is starting!
1000 of 30000, Atoms: 28519(A), 990(P), ChiSq: 171619760, Time: 00:00:56 / 01:34:51
2000 of 30000, Atoms: 30973(A), 967(P), ChiSq: 171261920, Time: 00:02:04 / 01:33:58
rpalaganas commented 6 months ago

The sparsity of the matrix that gave the -nans is 0.71. coop::sparsity(Hoxd10) #0.7125972

I also do not get -nan ChiSq when testing a matrix that is almost exactly as sparse.

x <- matrix(0, 380, 27277)
x[sample(length(x), size = round(0.29 * length(x)))] <- 1
coop::sparsity(x)  #0.7099992

params <- CogapsParams(nPatterns = 5, nIterations = 30000, seed = 42, distributed = "genome-wide", 
                       sparseOptimization = TRUE)
params <- setDistributedParams(params, nSets=5)
res <- CoGAPS(t(x), params = params, outputFrequency = 1000)

This is CoGAPS version 3.21.5 
Running genome-wide CoGAPS on t(x) (27277 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (5455, 5455.4, 5457)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...    worker 5 is starting!
    worker 4 is starting!
    worker 2 is starting!
Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
    worker 3 is starting!
1000 of 30000, Atoms: 10004(A), 1354(P), ChiSq: 42395544, Time: 00:00:15 / 00:25:24
2000 of 30000, Atoms: 12855(A), 1780(P), ChiSq: 42189580, Time: 00:00:38 / 00:28:47
dimalvovs commented 6 months ago

I thought that something is wrong in some genes' distributions, and used this fun to remove genes that would yield chisq nan in the results.

failRemoveRowsAll <- function(data){
  i <- 1
  j <- 6 #to keep genes > patterns 
  failnames <- c()
  params <- CogapsParams(nPatterns = 5, nIterations = 10, seed = 1,
                         sparseOptimization = TRUE)
  while (j <= nrow(data)) {
    res <- CoGAPS(data[c(i:j), ], params = params, outputFrequency = 10,
                  messages = FALSE)
    if (sum(is.na(res@metadata$chisq)) > 0) {
      failname <- rownames(data)[j]
      message(failname, ", at: ", j, " fails: ", length(failnames))
      failnames <- c(failnames, failname)
      data <- data[-c(j),]
    } else {
        j <- j + 1
    }
  }
  return(failnames)
}

afterwards, the chisq is not nan anymore, but the value itself is huge:

This is CoGAPS version 3.22.0 
Running Standard CoGAPS on hoxdata[!(rownames(hoxdata) %in% failed_by_all), ] (28470 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          10000 
seed                 1 
sparseOptimization   TRUE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
-- Equilibration Phase --
1000 of 10000, Atoms: 108128(A), 625(P), ChiSq: 1839920873269298576727893606400, Time: 00:01:01 / 00:30:39
2000 of 10000, Atoms: 115884(A), 588(P), ChiSq: 2292072881336368004474865713152, Time: 00:02:15 / 00:30:21
3000 of 10000, Atoms: 120147(A), 593(P), ChiSq: 2783612750416574213534292901888, Time: 00:03:29 / 00:29:30
4000 of 10000, Atoms: 123364(A), 572(P), ChiSq: 2550784517891173166148455759872, Time: 00:04:44 / 00:28:53
5000 of 10000, Atoms: 126268(A), 568(P), ChiSq: 2541503896605446561631530123264, Time: 00:06:01 / 00:28:30
6000 of 10000, Atoms: 126646(A), 563(P), ChiSq: 2905690684144169274910709907456, Time: 00:07:20 / 00:28:16
7000 of 10000, Atoms: 126687(A), 555(P), ChiSq: 2674804590947979129294038237184, Time: 00:08:38 / 00:27:57
8000 of 10000, Atoms: 126574(A), 561(P), ChiSq: 3147559868411558326579587186688, Time: 00:09:56 / 00:27:41
9000 of 10000, Atoms: 126684(A), 553(P), ChiSq: 2698749482425631185700149788672, Time: 00:11:13 / 00:27:23
10000 of 10000, Atoms: 126485(A), 553(P), ChiSq: 2957405810624188978088997027840, Time: 00:12:30 / 00:27:06
-- Sampling Phase --
1000 of 10000, Atoms: 126720(A), 556(P), ChiSq: 2646788641772774809142103638016, Time: 00:13:47 / 00:26:51
2000 of 10000, Atoms: 126701(A), 559(P), ChiSq: 2966075017676645483900815015936, Time: 00:15:05 / 00:26:40
3000 of 10000, Atoms: 126748(A), 561(P), ChiSq: 2824885175666762749901468073984, Time: 00:16:23 / 00:26:29
4000 of 10000, Atoms: 126825(A), 548(P), ChiSq: 2910313314246920713217492647936, Time: 00:17:41 / 00:26:19
5000 of 10000, Atoms: 126857(A), 542(P), ChiSq: 3040585044948408827482774437888, Time: 00:18:58 / 00:26:08
6000 of 10000, Atoms: 126800(A), 551(P), ChiSq: 3114951210047638030192943824896, Time: 00:20:16 / 00:25:59
7000 of 10000, Atoms: 126686(A), 548(P), ChiSq: 2923385429134413698483590529024, Time: 00:21:33 / 00:25:49
8000 of 10000, Atoms: 126779(A), 559(P), ChiSq: 2855560157182209446923168907264, Time: 00:22:51 / 00:25:41
9000 of 10000, Atoms: 126407(A), 551(P), ChiSq: 2890611147933206197900012421120, Time: 00:24:09 / 00:25:34
10000 of 10000, Atoms: 126802(A), 552(P), ChiSq: 2753871361865324913892758913024, Time: 00:25:26 / 00:25:26

compared to results on the same data with sparseOptimization = FALSE:

-- Equilibration Phase --
10000 of 10000, Atoms: 57241(A), 3038(P), ChiSq: 26866870, Time: 00:29:54 / 01:04:51

sparse sampler cannot find a proper solution? btw is that normal for chisq to increase over iterations?

jeanettejohnson commented 6 months ago

hey all-- a few notes regarding this issue @rpalaganas since the run is distributed, the P atoms being 0 in the second phase is correct, because at that point one matrix has been fixed and cogaps is learning the cognate (A) matrix

jeanettejohnson commented 6 months ago

Is this data more than 80% sparse?

rpalaganas commented 6 months ago

Is this data more than 80% sparse?

Slightly less, ~71% sparse

dimalvovs commented 6 months ago

so, we have technically addressed all the points addressed in the issue report:

  1. ChiSq value was -nan: these -nans appear as the actual value of ChiSq are too large, as can be confirmed here
  2. during the equilibration phase, the P matrix was 0: it is expected as in the distributed run one of the matrices is fixed (see comment above)
  3. SparseOptimization = TRUE gave 5 patterns while SparseOptimization = FALSE gave 6 patterns: number of patterns returned can differ from the number of patterns requested in the distributed mode, as the number of patterns is a superset of patterns matched across nSets, controlled by maxNS parameter. We may want to set the maxNS parameter to nPatterns by default to avoid this confusion.

The unsolved problem that is motivated by this issue is why the ChisQ is so large for a given dataset compared to a simulated dataset with similar dimensions and sparsity parameters, as demonstrated here.

dimalvovs commented 5 months ago

Interestingly the boostrapped version of the original data also fails


> resamp <- sample(hoxdata, size = length(hoxdata), replace = T)
> resamp <- matrix(resamp, ncol = 380)
> params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE, distributed="genome-wide")
> params <- setDistributedParams(params, nSets=5)
setting distributed parameters - call this again if you change nPatterns
> res <- CoGAPS(resamp, params)

This is CoGAPS version 3.23.1 
Running genome-wide CoGAPS on resamp (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          5 
cut            5 
minNS          3 
maxNS          8 

Creating subsets...
set sizes (min, mean, max): (6081, 6081.4, 6083)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
    worker 3 is starting!
    worker 4 is starting!
Loading Data...Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
    worker 5 is starting!
    worker 2 is starting!
1000 of 30000, Atoms: 24149(A), 137(P), ChiSq: nan, Time: 00:00:15 / 00:25:24
dimalvovs commented 5 months ago

Also sparseOptimization=TRUE fails for non-distributed mode:

#sparse optimized and not distributed
params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE)
params <- setDistributedParams(params, nSets=5)
res <- CoGAPS(hoxdata, params)

This is CoGAPS version 3.22.0 
Running Standard CoGAPS on hoxdata (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
-- Equilibration Phase --
1000 of 30000, Atoms: 86929(A), 369(P), ChiSq: nan, Time: 00:00:38 / 01:04:21
2000 of 30000, Atoms: 99716(A), 375(P), ChiSq: nan, Time: 00:01:29 / 01:07:26
3000 of 30000, Atoms: 103343(A), 386(P), ChiSq: nan, Time: 00:02:23 / 01:08:03
dimalvovs commented 5 months ago

Interestingly, sampling from a histogram does not fail:

#sample from histogram of data
hox_hist <- hist(hoxdata, breaks = 100, plot = FALSE)

hox_sim <- sample(hox_hist$mids, size = length(hoxdata),
  replace = T, prob = hox_hist$density)
hox_sim <- matrix(jitter(hox_sim), ncol = 380)

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, sparseOptimization=TRUE)
res <- CoGAPS(hox_sim, params)

This is CoGAPS version 3.22.0 
Running Standard CoGAPS on hox_sim (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
-- Equilibration Phase --
1000 of 30000, Atoms: 73945(A), 1873(P), ChiSq: 221560368, Time: 00:05:48 / 09:49:26
2000 of 30000, Atoms: 83256(A), 1960(P), ChiSq: 220883600, Time: 00:11:54 / 09:01:04
3000 of 30000, Atoms: 94329(A), 1951(P), ChiSq: 220712496, Time: 00:19:01 / 09:03:02
4000 of 30000, Atoms: 103541(A), 1927(P), ChiSq: 220577920, Time: 00:26:22 / 09:02:24
5000 of 30000, Atoms: 111937(A), 1922(P), ChiSq: 220456768, Time: 01:02:02 / 16:30:34
^C