StormSurgeLive / asgs

The Automated Solution Generation System (ASGS) provides software infrastructure for automating coastal ocean modelling for real time decision support, and provides a variety of standalone command line tools for pre- and post-processing. Visit us at https://discord.gg/jFbacxrUf9
https://tools.adcirc.live
GNU General Public License v3.0
39 stars 22 forks source link

RFC: "failed directories" should be a symlink, keep original directory in place #1323

Open wwlwpd opened 2 months ago

wwlwpd commented 2 months ago

@jasonfleming - When debugging a failed directory, because of the way that the submit scripts and output are set up, they all refer to the original name of the directory; every time I get one and have to debug, I always have to move the directory back to its original name/location.

I propose we create a symlink (same naming convention), instead of moving the entire directory. That way we can just do the manual troubleshooting immediately.

Thoughts?

jasonfleming commented 2 months ago

I think a symlink would work for scenarios that the ASGS will not re-try (like forecast scenarios). For nowcast scenarios, where the ASGS will keep re-trying over and over (in a directory with the same name path each time), the ASGS has to move the failed scenario directory out of the way each time and give it a new name. For those types of scenarios (that are re-tried with the same directory name), I would be in favor of having a command built into asgsh that an Operator could execute on a failed directory to see what its original path was. And possibly move it back to that original path, giving the incomplete or belatedly successful directory a temporary name.

wwlwpd commented 1 month ago

I think a symlink would work for scenarios that the ASGS will not re-try (like forecast scenarios). For nowcast scenarios, where the ASGS will keep re-trying over and over (in a directory with the same name path each time), the ASGS has to move the failed scenario directory out of the way each time and give it a new name. For those types of scenarios (that are re-tried with the same directory name), I would be in favor of having a command built into asgsh that an Operator could execute on a failed directory to see what its original path was. And possibly move it back to that original path, giving the incomplete or belatedly successful directory a temporary name.

I like this idea. The main need here is to do troubleshooting, and after a little poking around I tend to want to resubmit the job manually to see it break again for me. Thank you, I think more about it as we're actively debugging runs.

wwlwpd commented 1 month ago

Basic idea is to create a simple command to do what I need in this case, which is to either move it back to the original directory or update the submit scripts with the new "failed" directory name.