FertigLab / CoGAPS

Bayesian MCMC matrix factorization algorithm
https://www.bioconductor.org/packages/release/bioc/html/CoGAPS.html
BSD 3-Clause "New" or "Revised" License
61 stars 17 forks source link

CoGAPS does not learn to specifed nPatterns when runnign in dsitributed mode #100

Closed dtatarak closed 2 months ago

dtatarak commented 4 months ago

I am running CoGAPS on a small single-cell data set: 11623 genes x 900 cells. I have noticed that when I run CoGAPS in distributed mode, it will not produce the number of patterns I specified in nPatterns. Here is the full params stored in the result object: as

cogapsresult@metadata$params
-- Standard Parameters --
nPatterns            6 
nIterations          500 
seed                 1234 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          7 
cut            6 
minNS          4 
maxNS          11 

however, as you can see, only 4 patterns were learned:

cogapsresult
[1] "CogapsResult object with 11623 features and 900 samples"
[1] "4 patterns were learned"

Now, if I run not in distributed mode, it takes longer, but I get the number of patterns I asked for. Here are the parameters for this run:

cogapsresult@metadata$params
-- Standard Parameters --
nPatterns            6 
nIterations          500 
seed                 1234 
sparseOptimization   TRUE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

And the object itself:

cogapsresult
[1] "CogapsResult object with 11623 features and 900 samples"
[1] "6 patterns were learned"

I don't know why this is happening. I assumed I was overwriting some parameters when I created the distributed params object, but as you can see, the intended number of patterns is indeed being passed on the the CoGAPS function.

This data set is small, so I can afford to run in standard mode, but it's not scaleable without the ability to run distributed and generate the intended number of patterns. Could you please help me understand what's going on here? I'm hoping there's something simple I'm overlooking. Thanks!

ejfertig commented 4 months ago

This occurs because there’s a consensus step that seeks common patterns between the random sets of used for parallel analysis. This can happen when one of the sets contains a pattern that isn’t correlated with another, and therefore is added it. It can indicate you need a higher number of dimensions to capture the variation in your data.

On May 2, 2024, at 11:29 AM, dtatarak @.**@.>> wrote:

I am running CoGAPS on a small single-cell data set: 11623 genes x 900 cells. I have noticed that when I run CoGAPS in distributed mode, it will not produce the number of patterns I specified in nPatterns. Here is the full params stored in the result object: as

@.***$params

-- Standard Parameters -- nPatterns 6 nIterations 500 seed 1234 sparseOptimization TRUE distributed genome-wide

-- Sparsity Parameters -- alpha 0.01 maxGibbsMass 100

-- Distributed CoGAPS Parameters -- nSets 7 cut 6 minNS 4 maxNS 11

however, as you can see, only 4 patterns were learned:

cogapsresult

[1] "CogapsResult object with 11623 features and 900 samples" [1] "4 patterns were learned"

Now, if I run not in distributed mode, it takes longer, but I get the number of patterns I asked for. Here are the parameters for this run:

@.***$params

-- Standard Parameters -- nPatterns 6 nIterations 500 seed 1234 sparseOptimization TRUE

-- Sparsity Parameters -- alpha 0.01 maxGibbsMass 100

And the object itself:

cogapsresult

[1] "CogapsResult object with 11623 features and 900 samples" [1] "6 patterns were learned"

I don't know why this is happening. I assumed I was overwriting some parameters when I created the distributed params object, but as you can see, the intended number of patterns is indeed being passed on the the CoGAPS function.

This data set is small, so I can afford to run in standard mode, but it's not scaleable without the ability to run distributed and generate the intended number of patterns. Could you please help me understand what's going on here? I'm hoping there's something simple I'm overlooking. Thanks!

— Reply to this email directly, view it on GitHubhttps://github.com/FertigLab/CoGAPS/issues/100, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATMMKYTHY2WI6I6MGU65ALZAJLUHAVCNFSM6AAAAABHD6ICZGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TKOBSGU4TKNI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

dtatarak commented 4 months ago

This occurs because there’s a consensus step that seeks common patterns between the random sets of used for parallel analysis. This can happen when one of the sets contains a pattern that isn’t correlated with another, and therefore is added it. It can indicate you need a higher number of dimensions to capture the variation in your data. On May 2, 2024, at 11:29 AM, dtatarak @.**@.>> wrote: I am running CoGAPS on a small single-cell data set: 11623 genes x 900 cells. I have noticed that when I run CoGAPS in distributed mode, it will not produce the number of patterns I specified in nPatterns. Here is the full params stored in the result object: as @.$params -- Standard Parameters -- nPatterns 6 nIterations 500 seed 1234 sparseOptimization TRUE distributed genome-wide -- Sparsity Parameters -- alpha 0.01 maxGibbsMass 100 -- Distributed CoGAPS Parameters -- nSets 7 cut 6 minNS 4 maxNS 11 however, as you can see, only 4 patterns were learned: cogapsresult [1] "CogapsResult object with 11623 features and 900 samples" [1] "4 patterns were learned" Now, if I run not in distributed mode, it takes longer, but I get the number of patterns I asked for. Here are the parameters for this run: @.$params -- Standard Parameters -- nPatterns 6 nIterations 500 seed 1234 sparseOptimization TRUE -- Sparsity Parameters -- alpha 0.01 maxGibbsMass 100 And the object itself: cogapsresult [1] "CogapsResult object with 11623 features and 900 samples" [1] "6 patterns were learned" I don't know why this is happening. I assumed I was overwriting some parameters when I created the distributed params object, but as you can see, the intended number of patterns is indeed being passed on the the CoGAPS function. This data set is small, so I can afford to run in standard mode, but it's not scaleable without the ability to run distributed and generate the intended number of patterns. Could you please help me understand what's going on here? I'm hoping there's something simple I'm overlooking. Thanks! — Reply to this email directly, view it on GitHub<#100>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATMMKYTHY2WI6I6MGU65ALZAJLUHAVCNFSM6AAAAABHD6ICZGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TKOBSGU4TKNI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Ok that makes sense. So that being the case, it sounds like while using distributed mode, it isn't possible to enforce a hard number of patterns in the final result. Is that correct?

Would you expect this behavior to be different between genome-wide and single-cell modes of distribution?

Thanks very much for the information!

ejfertig commented 4 months ago

Correct - it can’t be locked in unless you manually fix the number of patterns in the pattern matching step. This would happen for both genome-wide and single-cell, but you may get different results as one parallelizes in the genes and one the cells.

On May 2, 2024, at 11:56 AM, dtatarak @.**@.>> wrote:

This occurs because there’s a consensus step that seeks common patterns between the random sets of used for parallel analysis. This can happen when one of the sets contains a pattern that isn’t correlated with another, and therefore is added it. It can indicate you need a higher number of dimensions to capture the variation in your data. On May 2, 2024, at 11:29 AM, dtatarak @.@.>> wrote: I am running CoGAPS on a small single-cell data set: 11623 genes x 900 cells. I have noticed that when I run CoGAPS in distributed mode, it will not produce the number of patterns I specified in nPatterns. Here is the full params stored in the result object: as @.$params …<x-msg://234/#> -- Standard Parameters -- nPatterns 6 nIterations 500 seed 1234 sparseOptimization TRUE distributed genome-wide -- Sparsity Parameters -- alpha 0.01 maxGibbsMass 100 -- Distributed CoGAPS Parameters -- nSets 7 cut 6 minNS 4 maxNS 11 however, as you can see, only 4 patterns were learned: cogapsresult [1] "CogapsResult object with 11623 features and 900 samples" [1] "4 patterns were learned" Now, if I run not in distributed mode, it takes longer, but I get the number of patterns I asked for. Here are the parameters for this run: @.$params -- Standard Parameters -- nPatterns 6 nIterations 500 seed 1234 sparseOptimization TRUE -- Sparsity Parameters -- alpha 0.01 maxGibbsMass 100 And the object itself: cogapsresult [1] "CogapsResult object with 11623 features and 900 samples" [1] "6 patterns were learned" I don't know why this is happening. I assumed I was overwriting some parameters when I created the distributed params object, but as you can see, the intended number of patterns is indeed being passed on the the CoGAPS function. This data set is small, so I can afford to run in standard mode, but it's not scaleable without the ability to run distributed and generate the intended number of patterns. Could you please help me understand what's going on here? I'm hoping there's something simple I'm overlooking. Thanks! — Reply to this email directly, view it on GitHub<#100https://github.com/FertigLab/CoGAPS/issues/100>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATMMKYTHY2WI6I6MGU65ALZAJLUHAVCNFSM6AAAAABHD6ICZGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TKOBSGU4TKNI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Ok that makes sense. So that being the case, it sounds like while using distributed mode, it isn't possible to enforce a hard number of patterns in the final result. Is that correct?

Would you expect this behavior to be different between genome-wide and single-cell modes of distribution?

Thanks very much for the information!

— Reply to this email directly, view it on GitHubhttps://github.com/FertigLab/CoGAPS/issues/100#issuecomment-2090879886, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATMMK4FALXMISNTS2BLHLDZAJO4VAVCNFSM6AAAAABHD6ICZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJQHA3TSOBYGY. You are receiving this because you commented.Message ID: @.***>

dimalvovs commented 2 months ago

Closing as answered.