Saving checkpoints - Githubissues

AllenNeuralDynamics / dynamic-foraging-task

Bonsai/Harp workflow for Dynamic Foraging with Python GUI for visualization and control

MIT License

5 stars 4 forks source link

Saving checkpoints #484

Closed yonibrowning closed 3 months ago

yonibrowning commented 4 months ago

Pull Request instructions:

Please follow the update protocol
Answer the questions below in detail. Your responses will be emailed to experimenters.
If the experimenters must do anything new, provide detailed step by step instructions on the wiki
If computer maintainers need to manually update anything, provide detailed step by step instructions
Use markdown syntax in order for your comments to be rendered reliably in the email: "1." instead of "1)", use four spaces for indents.
If you use the keyword "skip email" in the title, it will skip the email updates
Merges from "develop" into "production_testing" should use the keyword "production merge" in the title for reliable indexing of updates
Merges from "production_testing" into "main" should use the keyword "update main"

Describe changes:

Saving now runs continuously during the task. I have not yet removed the old saving capacity but have confirmed that I can reconstruct the old-style data in its entirety.

At the beginning of the task, initial settings are logged. At the end of each trial, the last thing that happens is now that three additional files are created (1) Trial Data ('TP' encoded variables) (2) State data ('B' and 'BS_' encoded variables) and (3) Settings data (All other previously logged variables). Trial data has one entry per variable per trial. In all other cases, data are only logged if they changed since the last trial. For arrays/lists that change sizes, only added values are logged. Thus, all files (0-N) are needed to reconstruct the task state at trial N.

I also added loading functionality- if the .json is in a "per_trial_logging" folder, the load feature will now assume this new data structure and load data accordingly. Behavior is otherwise unchanged from loading an old-style file.

What issues or discussions does this update address?

After any crashes will not result in loss of the entire session. In the event of a crash, only the last trial's data will be lost. The task can be restored to the state present immediately before the crash.

Describe the expected change in behavior from the perspective of the experimenter

Currently none, with the following cavoites:

(1) In the longer term this method could replace the need for saving a monolithic json at the end of the session, but at present this code is still in place. (2) right now, data of <10-20 kb are logged per trial (this is in contrast to the mb size full log). This is most likely small enough that we can avoid designating a separate thread for logging, but if task timing is affected in any way we should consider fixing via threading.

Describe any manual update steps for task computers

Should be none.

Was this update tested in 446/447?

No. Someone should test in a variety of situations (opto, ophys, plain behavior w/ weights, etc.) to confirm that everything is working. I have included code to confirm that "old-style" logs can be recapitulated exactly, so we should be able to do this testing without risk to anyone's workflows/data.

XX-Yin commented 4 months ago

Why do we need to update the bonsai files?

alexpiet commented 4 months ago

I added the pull request instructions back into the PR body, since they are required for my update email to parse correctly

alexpiet commented 4 months ago

If a session crashes and we do not have the full session format data, is the intent to just use the checkpoint format? Or will there be a process to convert the checkpoint format into a full session format?

If we keep the full session format, then we need to update how the auto-train process works, the quick load process, and the nwb generation.

XX-Yin commented 4 months ago

@yonibrowning @alexpiet I think I made a version with minimal changes and the save is running in a different thread https://github.com/AllenNeuralDynamics/dynamic-foraging-task/pull/485

yonibrowning commented 4 months ago

@XX-Yin @alexpiet The changes to the bonsai workflow are unintentional- I was having trouble with the firewall on my computer blocking video.

yonibrowning commented 4 months ago

@XX-Yin The "best" solution is probably the following- I think we should keep the lightweight checkpoint files but consider saving them on a separate thread. The problem remains that the existing save format increase the size of the file it is writing (in the end it is several megabytes) and the potential for problems increases with the length of the session. Threading helps but doesn't alleviate this - saving small amounts of data each trial is always going to be safer.

XX-Yin commented 4 months ago

@XX-Yin The "best" solution is probably the following- I think we should keep the lightweight checkpoint files but consider saving them on a separate thread. The problem remains that the existing save format increase the size of the file it is writing (in the end it is several megabytes) and the potential for problems increases with the length of the session. Threading helps but doesn't alleviate this - saving small amounts of data each trial is always going to be safer.

@yonibrowning @alexpiet

I don't think the saving time is an issue. I measured the saving time changed with trial number (shown below). We do see increase of the saving time with trial number, but it only took 0.35s to save a session with 2000 trials including generating the session and rig metadata. We can't run 2000 trials, and the duration of each trial is much longer than the 0.35s.

alexpiet commented 4 months ago

@XX-Yin Interesting data. Was this rig also collecting behavior video? If the computer is also logging video, then write times to the hard drive are going to be the bottleneck. Its going to be better for both reasons we can measure now, and unknown consequences to only write each additional trial to file.

But the checkpoint file format requires updates to other parts of the code. So I think we should just "rebuild" the checkpoint files into a full session file at the end of the session, or after a crash.

XX-Yin commented 4 months ago

@XX-Yin Interesting data. Was this rig also collecting behavior video? If the computer is also logging video, then write times to the hard drive are going to be the bottleneck. Its going to be better for both reasons we can measure now, and unknown consequences to only write each additional trial to file.

But the checkpoint file format requires updates to other parts of the code. So I think we should just "rebuild" the checkpoint files into a full session file at the end of the session, or after a crash.

@alexpiet The video data is 80G/h for 4 cameras, and the writing speed of standard SSD is 1828G/h. I predict that we will not see too much increase of saving time with video. I will test it anyway.

yonibrowning commented 4 months ago

I think the checkpoint files will ultimately be the safer option- I worry that it will be very hard to detect intermittent saving issues until they happen.

I want to make sure that I am understanding the reservations here. @XX-Yin: your concern is that checkpoint files change the saving method and so (1) might mess with existing code that use the saved and (2) might increase maintenance effort/tech debt by having to maintain two modes of saving?

I would like to suggest the following to address these concerns: (1) We do not change the existing saving method. As @allexpiet suggests, we only touch the checkpoint files in case of emergency. Once we have rigorously tested the checkpointing method, we can even consider setting the main save function to delete these files so that are only stored in the case of a crash. (2) We rework the checkpoint code such that it shares common 'data -getting' scripts with the end-of-session "save" functionality. This would ensure that any changes made to the core saving functionality would automatically propagate to the checkpoints code. (3) We bump the checkpointing onto its own thread, so it can't interfere with the task loop if something goes weird here.

Pending consensus, I can forage ahead (pun intended) with incorporating this + feedback into this PR?

Sorry for being so persistent on this- I had huge problems with data logging interfering with smooth task running in grad school. Turns out monkeys are graphics snobs who won't tolerate dropped frames... anyway, I realize there is no visual stimulus here so things are going to be less temperamental, but I would like to take this opportunity to future-proof this setup, so we don't run into problems later!

On Fri, May 31, 2024 at 10:09 AM XX-Yin @.***> wrote:

@XX-Yin https://github.com/XX-Yin Interesting data. Was this rig also collecting behavior video? If the computer is also logging video, then write times to the hard drive are going to be the bottleneck. Its going to be better for both reasons we can measure now, and unknown consequences to only write each additional trial to file.

But the checkpoint file format requires updates to other parts of the code. So I think we should just "rebuild" the checkpoint files into a full session file at the end of the session, or after a crash.

@alexpiet https://github.com/alexpiet The video data is 80G/h for 4 cameras, and the writing speed of standard SSD is 1828G/h. I predict that we will not see too much increase of saving time with video. I will test it anyway.

— Reply to this email directly, view it on GitHub https://github.com/AllenNeuralDynamics/dynamic-foraging-task/pull/484#issuecomment-2142666142, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDXHDTBQTEL4LODIND5Y2DZFCVFLAVCNFSM6AAAAABIRUUPDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSGY3DMMJUGI . You are receiving this because you were mentioned.Message ID: @.*** com>

-- Yoni Browning Buffalo Lab; Fairhall Lab UW Neuroscience Program

yonibrowning commented 4 months ago

Vaguely related-

We just did some profiling on ephys rig 2. With 4 cameras running at high speed, disk writing is very low. CPU and GPU usage is very high, however, so those could be the current optimization bottlenecks. The former of these would still benefit from smaller writes, but really this may come down to a video optimization problem particularly as we scale the number of cameras.

On Fri, May 31, 2024 at 10:42 AM Yoni Browning @.***> wrote:

I think the checkpoint files will ultimately be the safer option- I worry that it will be very hard to detect intermittent saving issues until they happen.

I want to make sure that I am understanding the reservations here. @XX-Yin: your concern is that checkpoint files change the saving method and so (1) might mess with existing code that use the saved and (2) might increase maintenance effort/tech debt by having to maintain two modes of saving?

I would like to suggest the following to address these concerns: (1) We do not change the existing saving method. As @allexpiet suggests, we only touch the checkpoint files in case of emergency. Once we have rigorously tested the checkpointing method, we can even consider setting the main save function to delete these files so that are only stored in the case of a crash. (2) We rework the checkpoint code such that it shares common 'data -getting' scripts with the end-of-session "save" functionality. This would ensure that any changes made to the core saving functionality would automatically propagate to the checkpoints code. (3) We bump the checkpointing onto its own thread, so it can't interfere with the task loop if something goes weird here.

Pending consensus, I can forage ahead (pun intended) with incorporating this + feedback into this PR?

Sorry for being so persistent on this- I had huge problems with data logging interfering with smooth task running in grad school. Turns out monkeys are graphics snobs who won't tolerate dropped frames... anyway, I realize there is no visual stimulus here so things are going to be less temperamental, but I would like to take this opportunity to future-proof this setup, so we don't run into problems later!

On Fri, May 31, 2024 at 10:09 AM XX-Yin @.***> wrote:

@XX-Yin https://github.com/XX-Yin Interesting data. Was this rig also collecting behavior video? If the computer is also logging video, then write times to the hard drive are going to be the bottleneck. Its going to be better for both reasons we can measure now, and unknown consequences to only write each additional trial to file.

But the checkpoint file format requires updates to other parts of the code. So I think we should just "rebuild" the checkpoint files into a full session file at the end of the session, or after a crash.

@alexpiet https://github.com/alexpiet The video data is 80G/h for 4 cameras, and the writing speed of standard SSD is 1828G/h. I predict that we will not see too much increase of saving time with video. I will test it anyway.

— Reply to this email directly, view it on GitHub https://github.com/AllenNeuralDynamics/dynamic-foraging-task/pull/484#issuecomment-2142666142, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDXHDTBQTEL4LODIND5Y2DZFCVFLAVCNFSM6AAAAABIRUUPDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSGY3DMMJUGI . You are receiving this because you were mentioned.Message ID: @.*** com>

-- Yoni Browning Buffalo Lab; Fairhall Lab UW Neuroscience Program

-- Yoni Browning Buffalo Lab; Fairhall Lab UW Neuroscience Program

XX-Yin commented 4 months ago

I think the checkpoint files will ultimately be the safer option- I worry that it will be very hard to detect intermittent saving issues until they happen. I want to make sure that I am understanding the reservations here. @XX-Yin: your concern is that checkpoint files change the saving method and so (1) might mess with existing code that use the saved and (2) might increase maintenance effort/tech debt by having to maintain two modes of saving? I would like to suggest the following to address these concerns: (1) We do not change the existing saving method. As @allexpiet suggests, we only touch the checkpoint files in case of emergency. Once we have rigorously tested the checkpointing method, we can even consider setting the main save function to delete these files so that are only stored in the case of a crash. (2) We rework the checkpoint code such that it shares common 'data -getting' scripts with the end-of-session "save" functionality. This would ensure that any changes made to the core saving functionality would automatically propagate to the checkpoints code. (3) We bump the checkpointing onto its own thread, so it can't interfere with the task loop if something goes weird here. Pending consensus, I can forage ahead (pun intended) with incorporating this + feedback into this PR? Sorry for being so persistent on this- I had huge problems with data logging interfering with smooth task running in grad school. Turns out monkeys are graphics snobs who won't tolerate dropped frames... anyway, I realize there is no visual stimulus here so things are going to be less temperamental, but I would like to take this opportunity to future-proof this setup, so we don't run into problems later! … On Fri, May 31, 2024 at 10:09 AM XX-Yin @.> wrote: @XX-Yin https://github.com/XX-Yin Interesting data. Was this rig also collecting behavior video? If the computer is also logging video, then write times to the hard drive are going to be the bottleneck. Its going to be better for both reasons we can measure now, and unknown consequences to only write each additional trial to file. But the checkpoint file format requires updates to other parts of the code. So I think we should just "rebuild" the checkpoint files into a full session file at the end of the session, or after a crash. @alexpiet https://github.com/alexpiet The video data is 80G/h for 4 cameras, and the writing speed of standard SSD is 1828G/h. I predict that we will not see too much increase of saving time with video. I will test it anyway. — Reply to this email directly, view it on GitHub <#484 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDXHDTBQTEL4LODIND5Y2DZFCVFLAVCNFSM6AAAAABIRUUPDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBSGY3DMMJUGI . You are receiving this because you were mentioned.Message ID: @. com> -- Yoni Browning Buffalo Lab; Fairhall Lab UW Neuroscience Program

@yonibrowning I think you are overconcerned with the saving time issue. The json file is just less than 2M, and it takes less than 0.25 s to save it in 1000 trials with 4 cameras at 500Hz (the measured saving speed with four cameras shown below). The minimal time duration for each trial is 4s. It's very unlikely the saving is more than 4 seconds, and that trial is crashed. If saving is a problem, I expect that saving single trials will have the same problem as saving the entire session given the entire session is just 2M.

I am concerned with the complexity and redundancy of your current code. 1) You need to combine each trial. 2) You also need to generate the session and rig metadata afterwards. 3) code is redundant with the existing saving methods, which is easier to make mistakes.

XX-Yin commented 4 months ago

Another session with four cameras

yonibrowning commented 4 months ago

I think the disagreement between this and https://github.com/AllenNeuralDynamics/dynamic-foraging-task/pull/485 boils down to this: I understand that @XX-Yin is concerned about the added maintenance effort needed to maintain this "checkpoint" solution, but I think that that maintenance effort can be minimized though better functionalization of saving code and unnecessary writing-to-disk is something that is worth some front-end effort to avoid.

At the end of the day, though, it's more important to have a functioning solution than a perfect one. Here is my vote: lets merge https://github.com/AllenNeuralDynamics/dynamic-foraging-task/pull/485 for now so that ephys rigs can stay up to date while having some backup but make a note to have a discussion with the rest of the group when there is a chance to decide which method to keep in the long term.

alexpiet commented 3 months ago

I propose closing this PR since we merged in #485