StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

Legion Spy: Assertion error from legion spy #1684

Closed shriram-jagan closed 5 months ago

shriram-jagan commented 5 months ago

I'm getting the following error from legion_spy.py when I run legion_spy.py -lp legate_0.log. The log file is here: legate_0.log.gz. I'm trying to make sure legion spy passes on this log file. Any workarounds or suggestions would be helpful.

Performing logical analysis...
Performing logical dependence verification for legion_python_main...
Traceback (most recent call last):
  File "/Users/sjagannathan/work/cllr/apps/pyminiweather/nans/eos/./legion_spy.py", line 14650, in <module>
    main(temp_dir)
  File "/Users/sjagannathan/work/cllr/apps/pyminiweather/nans/eos/./legion_spy.py", line 14583, in main
    state.perform_logical_analysis(logical_checks)
  File "/Users/sjagannathan/work/cllr/apps/pyminiweather/nans/eos/./legion_spy.py", line 13681, in perform_logical_analysis
    task.perform_task_logical_verification()
  File "/Users/sjagannathan/work/cllr/apps/pyminiweather/nans/eos/./legion_spy.py", line 9092, in perform_task_logical_verification
    if not op.perform_op_logical_verification(op, previous_deps):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sjagannathan/work/cllr/apps/pyminiweather/nans/eos/./legion_spy.py", line 7368, in perform_op_logical_verification
    assert logical_op.context is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
lightsighter commented 5 months ago

Did this program exit cleanly?

shriram-jagan commented 5 months ago

I don't have the session on EOS to check the exit status, but I'd guess that the program exited cleanly because I see the messages that are printed at the end of the simulation, and I didn't see any error messages from running this app. There are NaNs in the simulation but I don't think that would have caused it to crash.

lightsighter commented 5 months ago

I'm unable to reproduce the crash running with the Legion Spy in the latest master branch. It's made it through logical verification and is still working on physical verification:

mebauer@c0004:~/legion/tools/shriram$ pypy3 -m pdb ../legion_spy.py -lpa legate_0.log 
> /home/mebauer/legion/tools/legion_spy.py(23)<module>()
-> from __future__ import absolute_import
(Pdb) r
Reading log file legate_0.log...
WARNING: Skipped 8 lines when reading legate_0.log
Reducing top-level index space shapes...
 |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0% 
Done
Computing refinement points...
Dim 1: |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0% 
Dim 2: |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0% 
Dim 3: |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0% 
Dim 4: |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0% 
Done
Computing physical reachable...
 |||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0% 
Done
Performing logical analysis...
Performing logical dependence verification for legion_python_main...
Pass
Checking for cycles...
No cycles detected
Simplifying event graph...
 |||||||||||||||||||||||----------------------------| 44.5% 
lightsighter commented 5 months ago

Note with a 97 MB Legion Spy log file, it's going to take the physical analysis a very long time to do the verification (assuming you don't run out of memory for the verification). You might want to try to reduce the number of iterations that you are running.

shriram-jagan commented 5 months ago

thanks for trying it on your end. I was using the legion_spy.py file that the docker image had.

yes, I tried a lot of combinations yesterday to reduce the problem size/gpu count, but I couldn't make it fail for smaller problem sizes.

anyway, let me take it from here.

lightsighter commented 5 months ago

The Legion Spy in the master branch successfully validated that log file for me.