Closed lwellerastro closed 2 years ago
Full job information:
export SACCT_FORMAT="JobID%20,JobName,User,Partition,NodeList,Elapsed,State,ExitCode,MaxRSS,AllocTRES%32"
(base) -bash-4.2$ sacct -j 45306855
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- --------------------------------
45306855 lq30jig lweller longall mem1 00:40:44 FAILED 11:0 billing=1,cpu=1,mem=50G,node=1
45306855.batch batch mem1 00:40:44 FAILED 11:0 19785568K cpu=1,mem=50G,node=1
45306855.extern extern mem1 00:40:45 COMPLETED 0:0 0 billing=1,cpu=1,mem=50G,node=1
If this runs in one environment and not another, we are going to need to know the differences between these environment. What OS are both of these machines running? Are the conda list
outputs identical on both machines?
If you are asking me @jlaura, I don't know the answers. As I understand it, the job inherits the version of isis when it is executed (conda activate isis5.0.2), but maybe I don't fully understand that either. I don't live in any other conda environments. Also, for the most part I am working on a vm IT made available to me last summer - igswzawgvsiplw. Not sure who can get on that system. It is a mimic of astrovm5. I also tried the above via astrovm5 and got similar results.
Is there something I can look at or do we need to involve IT? I chose not to send this problem to them initially because I figured they shrug and suggest posting here.
I was able to confirm yesterday that it's running the same conda environment, isis5.0.2 in the shared install. So, the differences will be mainly in the systems libraries that are dynamically linked in. I tried a little last night to check what was getting pulled in on the cluster but couldn't get an interactive session in my short time trying as the cluster was slammed.
Comparing how this is erroring against #3871, I'm not sure if they are related as this one is seg faulting and the lroc network is hitting a cholmod error. It could be the same underlying issue, but presenting differently? We'll need to do more investigating before we can say if or how they are related.
We'll need to determine where the seg fault is happening, which may require a custom build. The first step is to probably run this again in a dev build.
Ok, I ran conda list
for my VM igswzawgvsiplw, astrovm5, and mem1 in two different ways.
The files are under /work/users/lweller/Isis3Tests/Jigsaw/NebSegFault/.
diff igswzawgvsiplw_condalist.txt mem1_inherit_avlw_condalist.txt
These are identical. As suspected, when I send a job from my VM to mem1 (or any node), it picks up the submitting HOSTs conda environment.
diff igswzawgvsiplw_condalist.txt astrovm5_condalist.txt
The conda lists are slightly different between my VM and astrovm5. I'll have to confirm whether or not I tried to send jobs from astrovm5.
diff astrovm5_condalist.txt mem1_condalist.txt
These are identical. The second listing is via ssh mem1, then running conda list there.
And to be clear, I am sending my jigsaw jobs (Kaguya TC and Titan) using a single line send off via sbatch. I have my kaguya job running again via astrovm5 to confirm it doesn't work there (can't tell submit host from job info, there is no print file with this info either), and this is how I run it: (Update: this job failed from astrovm5 as well.)
sbatch --mem=50G --partition=longall --nodelist=mem1 --time=15:00:00 --job-name=av5lq30jig --output=LOG_Jig_AVM5 --wrap "jigsaw froml=KTC_Morning_LQ30_FFCombine_Del.lis cnet=KTC_Morning_LQ30_FFCombine_Thin_SubReg.net onet=JigOut_KTC_Morning_LQ30_FFCombine_Thin_SubReg.net radius=yes update=no sigma0=1.0e-5 maxits=1 camsolve=accelerations twist=yes overexisting=yes spsolve=position overhermite=yes camera_angles_sigma=0.25 camera_angular_velocity_sigma=0.1 camera_angular_acceleration_sigma=0.01 spacecraft_position_sigma=1000 point_radius_sigma=100 file_prefix=RadAccelTwist_SpkPos"
Submitted batch job 45314154
If it's of any interest, my Titan network gets to the point of failure much quicker than kaguya (in about 17 minutes) and there is a tad more in that log file, but still no error message, just a core dump which doesn't actually exist. I'm tossing the following input/output into (isis5.0.2) igs{155}> mkdir /work/users/lweller/Isis3Tests/Jigsaw/NebSegFault/Titan:
sbatch --partition=longall --nodelist=neb3 --mem=20G --time=15:00:00 --job-name=avlwcissjig --output=LOG_Jig_AVMLW --wrap "jigsaw froml=CISS_Titan_FFCombined7_Thin_PixReg_SubReg.lis cnet=CISS_Titan_FFCombined7_Thin_PixReg_SubReg_Ground.net onet=JigOut_CISS_Titan_FFCombined7_Thin_PixReg_SubReg_Ground.net radius=yes update=no errorprop=no sigma0=1.0e-10 maxits=10 camsolve=angles twist=yes spsolve=position camera_angles_sigma=1 spacecraft_position_sigma=2000 point_radius_sigma=2000 point_latitude_sigma=8000 point_longitude_sigma=8000 file_prefix=RadAngTwist_SpkPos_Ground"
Submitted batch job 45314153
I sent this to neb3 originally by choice because the cluster was getting hammered at the time. So it's not just mem1. The above was submitted from igswzawgvsiplw but I will run another from astrovm 5 as well.
@lwellerastro I came back to run some tests on the Kaguya network and found that the input network is missing. I can't find KTC_Morning_LQ30_FFCombine_Thin_SubReg.net in /work/users/lweller/Isis3Tests/Jigsaw/NebSegFault/
@jessemapel - oops, copied the wrong network into that directory. The correct one is there now. Sorry!
Good news, Craig was able to get me access to our debugger on the cluster over the weekend and I got this sorted out. There was an issue similar to #4545 where we were using an ALE call that wasn't working correctly. Fortunately, this got cleaned up a little while ago and fix in the later release of ALE.
So, I was actually able to run both bundles with ISIS 6.0, 7.0_RC1, and dev.
@lwellerastro Can you try and run these again with 6.0 or 7.0_rc2?
Great news @jessemapel! I'll turn on my bundles shortly and let you know how it goes.
@jessemapel, I'm still getting a seqfault while running under isis6.0.0 for the Kaguya network.
I'm running this from a system IT built me named igswzawgvsiplw which is essentially a clone of astrovm5. I've been working on it for about a year now with no problems other than this particular post (I can run the above jigsaw command on this VM and it works fine, just eats up memory). I'm duplicating what was in the original post under a different directory and the job went to mem1 if that matters.
I also sent the same job to the cluster running under isis7.0.0-RC2 and it has not crashed yet, but my LOG file is not showing progress.
I sent the Titan network to the cluster as well running under isis7.0.0-RC2 and it also has not crashed yet, but there haven't been updates to the LOG the file.
I think it's sort of odd that I'm not seeing any new standard out in my log files for the isis7.0.0-RC2 runs, especially for Titan which is much faster than Kaguya (they have been running for 1.5 hours). I'm going to execute an interactive run on the cluster to see what's going on.
Update: Kaguya ran two iterations (maxit=2) successfully via isis7.0.0-RC2. Titan is still running, but my command had maxits=10, so who knows how long it will take. It's odd the iteration info is not going to going to my stdout LOG until the end.
To recap - it looks like the problem is fixed via isis7.0.0-RC2, but continues to fail under isis6.0.0. At the moment, I have no problems running under isis7.0.0-RC2 or greater when there is a public release.
Stuff not hitting the log until the end a known thing. Since we updated the jigsaw app to be callable, we had to re-do how it logs and part of that is that it logs everything at the end.
We really need to get a real-time logger setup.
I'm going to close this for now as it seems like it's been resolved adequately. If we run into issues once the full 7.0 release is out we can come back.
sounds good @jessemapel. I didn't know about the output not writing to log until the end - thanks for the heads up. Titan finished as well under isis7.0.0-RC2.
ISIS version(s) affected: 5.0.2
Description
Jigsaw will segfault on astro's cluster when solving for camera accelerations and spacecraft position for my very large Kaguya TC south pole network. There are no error messages associated with the failure on the cluster and although it says a core was dumped, there is none. This happens whether I send a job, run an interactive job, or log on to a node directly and try and run it, or ask for an entire node and the memory available which is far more than the job needs.
It runs successfully on the astrovm's. Specifically, it runs on an astrovm built for me recently (igswzawgvsiplw), but it was built to match astrovm5.
This is not a memory issue as I watched the process run on my vm and it used ~20G of memory. From my job 45306855 that failed:
It looks like it fails right when it starts the first iteration.
I can send smaller quads to the cluster and solve for camera accelerations and spacecraft position without error, so this seems to be a size thing. I can also send the Kaguya TC polar network to the cluster and just solve for camera accelerations and that works as well (I think that uses about 17G of memory). The same is true for Titan - when I add spsolve=position it will not run on the cluster.
So the problem is specifically with large networks and solving for spacecraft position which is 100% necessary for both projects listed. This is likely related to https://github.com/USGS-Astrogeology/ISIS3/issues/3871
My Kaguya TC south pole network has 16238 images 548912 points 3899619 measures The points and measures exceed the LROC NAC south pole network.
How to reproduce
See data in /work/users/lweller/Isis3Tests/Jigsaw/NebSegFault/. See proc.txt for the command below.
See the very thin, not very informative stdout file SegFault_LOG_Jig from cluster run when/where it fails. Note, I added the output files and print.prt from the successful astrovm run for reference.
The takes a very long time to run, so I'd set maxits=1. When I first sent the job to the cluster I had it set to 5 iterations as below. I believe it took about 40+ minutes after sending to the cluster to start an iteration and die.
Possible Solution
Seems like nebula and it's nodes doesn't know something, or how to get access to something.
Additional context
Although I can run things my astrovm, I have limited memory there to run jigsaw, qnet and any other isis program at the same. Running Titan and Kaguya at the same time (which may happen since both take so long to run) limits what else I can do on the system. Additionally, there is only so much memory available on astrovm4 and astrovm5 for other project members to run their polar work when they get to it, in addition to the resources that are being used by the LROC NAC polar project.
We really need to be able to bundle these enormous networks on the cluster.
I was going to ask Janet to run her failing jigsaw on astrovm4 (has more memory than 5) to see if it is successful there.