Open wwlwpd opened 2 months ago
I think a symlink would work for scenarios that the ASGS will not re-try (like forecast scenarios). For nowcast scenarios, where the ASGS will keep re-trying over and over (in a directory with the same name path each time), the ASGS has to move the failed scenario directory out of the way each time and give it a new name. For those types of scenarios (that are re-tried with the same directory name), I would be in favor of having a command built into asgsh that an Operator could execute on a failed directory to see what its original path was. And possibly move it back to that original path, giving the incomplete or belatedly successful directory a temporary name.
I think a symlink would work for scenarios that the ASGS will not re-try (like forecast scenarios). For nowcast scenarios, where the ASGS will keep re-trying over and over (in a directory with the same name path each time), the ASGS has to move the failed scenario directory out of the way each time and give it a new name. For those types of scenarios (that are re-tried with the same directory name), I would be in favor of having a command built into asgsh that an Operator could execute on a failed directory to see what its original path was. And possibly move it back to that original path, giving the incomplete or belatedly successful directory a temporary name.
I like this idea. The main need here is to do troubleshooting, and after a little poking around I tend to want to resubmit the job manually to see it break again for me. Thank you, I think more about it as we're actively debugging runs.
Basic idea is to create a simple command to do what I need in this case, which is to either move it back to the original directory or update the submit scripts with the new "failed" directory name.
@jasonfleming - When debugging a failed directory, because of the way that the submit scripts and output are set up, they all refer to the original name of the directory; every time I get one and have to debug, I always have to move the directory back to its original name/location.
I propose we create a symlink (same naming convention), instead of moving the entire directory. That way we can just do the manual troubleshooting immediately.
Thoughts?