Closed shriram-jagan closed 7 months ago
Did this program exit cleanly?
I don't have the session on EOS to check the exit status, but I'd guess that the program exited cleanly because I see the messages that are printed at the end of the simulation, and I didn't see any error messages from running this app. There are NaNs in the simulation but I don't think that would have caused it to crash.
I'm unable to reproduce the crash running with the Legion Spy in the latest master branch. It's made it through logical verification and is still working on physical verification:
mebauer@c0004:~/legion/tools/shriram$ pypy3 -m pdb ../legion_spy.py -lpa legate_0.log
> /home/mebauer/legion/tools/legion_spy.py(23)<module>()
-> from __future__ import absolute_import
(Pdb) r
Reading log file legate_0.log...
WARNING: Skipped 8 lines when reading legate_0.log
Reducing top-level index space shapes...
|||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0%
Done
Computing refinement points...
Dim 1: |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0%
Dim 2: |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0%
Dim 3: |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0%
Dim 4: |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0%
Done
Computing physical reachable...
|||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0%
Done
Performing logical analysis...
Performing logical dependence verification for legion_python_main...
Pass
Checking for cycles...
No cycles detected
Simplifying event graph...
|||||||||||||||||||||||----------------------------| 44.5%
Note with a 97 MB Legion Spy log file, it's going to take the physical analysis a very long time to do the verification (assuming you don't run out of memory for the verification). You might want to try to reduce the number of iterations that you are running.
thanks for trying it on your end. I was using the legion_spy.py
file that the docker image had.
yes, I tried a lot of combinations yesterday to reduce the problem size/gpu count, but I couldn't make it fail for smaller problem sizes.
anyway, let me take it from here.
The Legion Spy in the master branch successfully validated that log file for me.
I'm getting the following error from
legion_spy.py
when I runlegion_spy.py -lp legate_0.log
. The log file is here: legate_0.log.gz. I'm trying to make sure legion spy passes on this log file. Any workarounds or suggestions would be helpful.