LLNL / hiop

HPC solver for nonlinear optimization problems
Other
210 stars 42 forks source link

Checkpointing (quasi-Newton solver) #693

Closed cnpetra closed 1 month ago

cnpetra commented 2 months ago

Adds API for checkpointing and support for restart via user options.

closes #686

tepperly commented 1 month ago

I thought about it some more over lunch, and I think load_state_from_data_store() and save_state_to_data_store() should take a ::axom::sidre::Group & as it's argument rather than a ::axom::sidre::DataStore &. The Group argument should specify the group where the save/load should take place from rather than it being a fixed location relative to the root DataStore.

You can get the DataStore from the Group in case there is a need for that.

IMHO, this is an improvement because if HiOp is part of a complex program, the outer program may want to encapsulate the HiOp part of the checkpoint in some group that's potentially several levels down from the DataStore root group. It also leaves the door open for their to be multiple instances of HiOp active in the running program that all need to be checkpointed. The current design can only support checkpointing of one HiOp instance in a given DataStore.

cnpetra commented 1 month ago

I thought about it some more over lunch, and I think load_state_from_data_store() and save_state_to_data_store() should take a ::axom::sidre::Group & as it's argument rather than a ::axom::sidre::DataStore &. The Group argument should specify the group where the save/load should take place from rather than it being a fixed location relative to the root DataStore.

You can get the DataStore from the Group in case there is a need for that.

IMHO, this is an improvement because if HiOp is part of a complex program, the outer program may want to encapsulate the HiOp part of the checkpoint in some group that's potentially several levels down from the DataStore root group. It also leaves the door open for their to be multiple instances of HiOp active in the running program that all need to be checkpointed. The current design can only support checkpointing of one HiOp instance in a given DataStore.

I take it that load_state_from_data_store() and save_state_to_data_store() "APIs" would do the job on your end. I see the flexibility of working with the sidre::group and will make the change. Will have you as a reviewer to have a final look when the PR ready.

tepperly commented 1 month ago

I thought about it some more over lunch, and I think load_state_from_data_store() and save_state_to_data_store() should take a ::axom::sidre::Group & as it's argument rather than a ::axom::sidre::DataStore &. The Group argument should specify the group where the save/load should take place from rather than it being a fixed location relative to the root DataStore. You can get the DataStore from the Group in case there is a need for that. IMHO, this is an improvement because if HiOp is part of a complex program, the outer program may want to encapsulate the HiOp part of the checkpoint in some group that's potentially several levels down from the DataStore root group. It also leaves the door open for their to be multiple instances of HiOp active in the running program that all need to be checkpointed. The current design can only support checkpointing of one HiOp instance in a given DataStore.

I take it that load_state_from_data_store() and save_state_to_data_store() "APIs" would do the job on your end. I see the flexibility of working with the sidre::group and will make the change. Will have you as a reviewer to have a final look when the PR ready.

The load/save methods offer the most flexibility for LiDO and its clients. I haven't double checked that we have the right HiOp object handle to call them.

cnpetra commented 1 month ago

I thought about it some more over lunch, and I think load_state_from_data_store() and save_state_to_data_store() should take a ::axom::sidre::Group & as it's argument rather than a ::axom::sidre::DataStore &. The Group argument should specify the group where the save/load should take place from rather than it being a fixed location relative to the root DataStore. You can get the DataStore from the Group in case there is a need for that. IMHO, this is an improvement because if HiOp is part of a complex program, the outer program may want to encapsulate the HiOp part of the checkpoint in some group that's potentially several levels down from the DataStore root group. It also leaves the door open for their to be multiple instances of HiOp active in the running program that all need to be checkpointed. The current design can only support checkpointing of one HiOp instance in a given DataStore.

I take it that load_state_from_data_store() and save_state_to_data_store() "APIs" would do the job on your end. I see the flexibility of working with the sidre::group and will make the change. Will have you as a reviewer to have a final look when the PR ready.

The load/save methods offer the most flexibility for LiDO and its clients. I haven't double checked that we have the right HiOp object handle to call them.

LiDO for sure instantiates the algorithm class somewhere in there. That "algorithm" object should be used. For example, attach it to the interface. Will provide an example in this PR.

cnpetra commented 1 month ago

@tepperly please take a look here and here at the use of the load/save API

tepperly commented 1 month ago

I'll try to look at it early this week. It's PA season.

On Sun, Sep 22, 2024, 5:14 PM Cosmin G Petra @.***> wrote:

@cnpetra https://github.com/cnpetra requested your review on: #693 https://github.com/LLNL/hiop/pull/693 Checkpointing (quasi-Newton solver).

— Reply to this email directly, view it on GitHub https://github.com/LLNL/hiop/pull/693#event-14355934431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACM3KAWWTXVWUBXXPWCVWLZX5MNFAVCNFSM6AAAAABNI3FQ5WVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJUGM2TKOJTGQ2DGMI . You are receiving this because your review was requested.Message ID: @.***>

tepperly commented 1 month ago

@tepperly please take a look here and here at the use of the load/save API

I looked over it again, and it looks good to me.