Trying to run 3-state (2 spot states) HMM data - getting CUDA memory error and "Iteration started with a new seed" warnings

zhoudan-brandeis commented 1 year ago

other details: running v1.1.17, have previously successfully run the same data with a 2-state (1 spot state) HMM model.

Reduced spot/frame batches from 10->5 and 512-> 256 and still get many iterations (hundreds! example image only shows last few) of the warning before ultimately running out of CUDA memory

ordabayevy commented 1 year ago

The program restarts the run when there are NaN values detected in the parameters. It is usually ok if it happens small number of times during the entire run.

If it happens repeatedly, like in your case, then there is something pathological. It is hard to tell if it is related to the data or the model without inspecting it deeply. Can we setup a Zoom meeting to have a closer look at this together?

ordabayevy commented 1 year ago

I also see that it has run 50800 iterations. How close it is to being converged when you look at Tensorboard?

zhoudan-brandeis commented 1 year ago

I found a workaround that I think may give you a clue: I made a new directory and put in the same data (driftlist, header, on/off spots). Since there wasn't a .tapqir folder, it seems to be running smoothly (10% and counting)

On Tue, Feb 14, 2023, 8:31 PM Yerdos Ordabayev @.***> wrote:

The program restarts the run when there are NaN values detected in the parameters. It is usually ok if it happens small number of times during the entire run.

If it happens repeatedly, like in your case, then there is something pathological. It is hard to tell if it is related to the data or the model without inspecting it deeply. Can we setup a Zoom meeting to have a closer look at this together?

— Reply to this email directly, view it on GitHub https://github.com/gelles-brandeis/tapqir/issues/420#issuecomment-1430623063, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4HEVDEEH5QMYKY2CHLTM7DWXQWXBANCNFSM6AAAAAAU4DY2OQ . You are receiving this because you authored the thread.Message ID: @.***>

ordabayevy commented 1 year ago

Oh I guess that is the reason. The name of the model file is the same for 2 and 3 states hmm models. Since you already have run 2 state model you have that one saved in the .tapqir folder. Now when you try to run 3 state hmm it loads the model file for a 2 state hmm and tries to continue from there. That's why it says iteration 50800. So running it in a different analysis folder should fix the problem.

gelles-brandeis / tapqir

Trying to run 3-state (2 spot states) HMM data - getting CUDA memory error and "Iteration started with a new seed" warnings #420