donaldRwilliams / chkptstanr

Checkpoint Stan R
https://donaldrwilliams.github.io/chkptstanr/
Other
9 stars 5 forks source link

Feature Request: Define breaks for stopping #4

Open peclayson opened 2 years ago

peclayson commented 2 years ago

I've been playing around with the package a bit. It's really cool!

I read the documentation for chkpt_brms, and I was wondering whether it would be possible in a future release to define actual breaks programmatically (so a user doesn't have to rely on the 'stop' button). I'm hoping for a way to circumvent the need for a user to interact with the fitting.

E.g., if iter_warmup = 5000, iter_sampling = 15000, iter_perchkpt = 1000, a separate input could force the fitting to stop every 5,000 iterations that could be picked up again by a later call from chkpt_brms.

The application I am thinking about is running models on a computer cluster, rather than a desktop. My hope is to force breaks to split up long jobs so they can be run on nodes with shorter wall times.

Thanks, Peter

donaldRwilliams commented 2 years ago

hey !

I think that should be possible, but will have to think a bit about how to implement.

In R Studio, there is a way to schedule running a .R file. So here if you have chkpt_brms, then I dont think you would have to interact with it (pretty sure this will work, as this is the use case we had in mind).

Let me think about this a bit more !!

peclayson commented 2 years ago

I don't see any issue using chkpt_brms on the cluster (I've only used it on my desktop so far). I plan to try it out after the semester is over. It will be helpful for saving time after node failures... :)

My hope is that if I have the break built in, once chkpt_brms gets to the breaking point, the function finishes, and then it moves on through the script to queue up another job on the cluster to pick up the baton.

Although it's possible to pick up where the job left off by submitting another job, I would like to automate queuing up the next job. If the script terminates due to reaching the max walltime, it wouldn't continue processing the code to pass the baton.

Thanks, Donald!

donaldRwilliams commented 2 years ago

My hope is that if I have the break built in, once chkpt_brms gets to the breaking point, the function finishes, and then it moves on through the script to queue up another job on the cluster to pick up the baton.

I see ! let met think about how best to implement this, and will update here with some ideas. Of course, open to ideas you have about how to implement that in the package..

venpopov commented 8 months ago

I implemented this in an open pull request and then saw there was already this request for it

https://github.com/donaldRwilliams/chkptstanr/pull/14