StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
686 stars 145 forks source link

Difference between Python and Rust profilers on partial multi-node runs #1319

Closed elliottslaughter closed 7 months ago

elliottslaughter commented 2 years ago

I'm running master at 8a459255909290609611f6f93eb788c9051a3765.

This log file generates different results with Python and Rust: http://sapling.stanford.edu/~eslaught/legion_prof_diff_2022-09-06/prof_64x1_r0_0.gz

Profiler output:

Here's what the diff looks like:

diff -u -r legion_prof_rs/json/utils.json legion_prof_py/json/utils.json
--- legion_prof_rs/json/utils.json      2022-09-06 11:16:41.085886969 -0700
+++ legion_prof_py/json/utils.json      2022-09-06 11:28:28.304097874 -0700
@@ -1 +1 @@
-{"0":["0 (CPU)","0 (Utility)","0 (System Memory)","0 (Channel)"],"1":["1 (Channel)"],"32":["32 (Channel)"],"480":["480 (Channel)"],"all":["all (CPU)","all (Utility)","all (System Memory)","all (Channel)"]}
\ No newline at end of file
+{"0":["0 (CPU)","0 (Utility)","0 (System Memory)","0 (Channel)"]}
\ No newline at end of file
diff -u -r legion_prof_rs/legion_prof_processor.tsv legion_prof_py/legion_prof_processor.tsv
--- legion_prof_rs/legion_prof_processor.tsv    2022-09-06 11:16:43.405749928 -0700
+++ legion_prof_py/legion_prof_processor.tsv    2022-09-06 11:28:28.300098110 -0700
@@ -6,11 +6,11 @@
 CPU Processor 0x1d00000000000004       CPU Proc 4      tsv/Proc_0x1d00000000000004.tsv 1
 Dependent Partition Channel    Dependent Partition Channel     tsv/None.tsv    3
 Fill System Memory 0x1e00000000000000 Channel  [n0] sys        tsv/System_Memory_0x1e00000000000000.tsv        3
-System Memory 0x1e00000000000000 to System Memory 0x1e00000000000000 Channel   [n0] sys to [n0] sys    tsv/(System_Memory_0x1e00000000000000,_System_Memory_0x1e00000000000000).tsv        4
-System Memory 0x1e00000000000000 to System Memory 0x1e00010000000000 Channel   [n0] sys to [n1] sys    tsv/(System_Memory_0x1e00000000000000,_System_Memory_0x1e00010000000000).tsv        3
-System Memory 0x1e00000000000000 to System Memory 0x1e00200000000000 Channel   [n0] sys to [n32] sys   tsv/(System_Memory_0x1e00000000000000,_System_Memory_0x1e00200000000000).tsv        3
-System Memory 0x1e00000000000000 to System Memory 0x1e01e00000000000 Channel   [n0] sys to [n480] sys  tsv/(System_Memory_0x1e00000000000000,_System_Memory_0x1e01e00000000000).tsv        3
-System Memory 0x1e00010000000000 to System Memory 0x1e00000000000000 Channel   [n1] sys to [n0] sys    tsv/(System_Memory_0x1e00010000000000,_System_Memory_0x1e00000000000000).tsv        4
-System Memory 0x1e00200000000000 to System Memory 0x1e00000000000000 Channel   [n32] sys to [n0] sys   tsv/(System_Memory_0x1e00200000000000,_System_Memory_0x1e00000000000000).tsv        4
-System Memory 0x1e01e00000000000 to System Memory 0x1e00000000000000 Channel   [n480] sys to [n0] sys  tsv/(System_Memory_0x1e01e00000000000,_System_Memory_0x1e00000000000000).tsv        3
+System Memory 0x1e00000000000000 to System Memory 0x1e00000000000000 Channel   [n0] sys to [n0] sys    tsv/System_Memory_0x1e00000000000000_System_Memory_0x1e00000000000000.tsv   4
+System Memory 0x1e00000000000000 to System Memory 0x1e00010000000000 Channel   [n0] sys to [n1] sys    tsv/System_Memory_0x1e00000000000000_System_Memory_0x1e00010000000000.tsv   3
+System Memory 0x1e00000000000000 to System Memory 0x1e00200000000000 Channel   [n0] sys to [n32] sys   tsv/System_Memory_0x1e00000000000000_System_Memory_0x1e00200000000000.tsv   3
+System Memory 0x1e00000000000000 to System Memory 0x1e01e00000000000 Channel   [n0] sys to [n480] sys  tsv/System_Memory_0x1e00000000000000_System_Memory_0x1e01e00000000000.tsv   3
+System Memory 0x1e00010000000000 to System Memory 0x1e00000000000000 Channel   [n1] sys to [n0] sys    tsv/System_Memory_0x1e00010000000000_System_Memory_0x1e00000000000000.tsv   4
+System Memory 0x1e00200000000000 to System Memory 0x1e00000000000000 Channel   [n32] sys to [n0] sys   tsv/System_Memory_0x1e00200000000000_System_Memory_0x1e00000000000000.tsv   4
+System Memory 0x1e01e00000000000 to System Memory 0x1e00000000000000 Channel   [n480] sys to [n0] sys  tsv/System_Memory_0x1e01e00000000000_System_Memory_0x1e00000000000000.tsv   3
 System Memory 0x1e00000000000000       [n0] sys        tsv/Mem_0x1e00000000000000.tsv  36
Only in legion_prof_rs/tsv: (System_Memory_0x1e00000000000000,_System_Memory_0x1e00000000000000).tsv
Only in legion_prof_py/tsv: System_Memory_0x1e00000000000000_System_Memory_0x1e00000000000000.tsv
Only in legion_prof_rs/tsv: (System_Memory_0x1e00000000000000,_System_Memory_0x1e00010000000000).tsv
Only in legion_prof_py/tsv: System_Memory_0x1e00000000000000_System_Memory_0x1e00010000000000.tsv
Only in legion_prof_rs/tsv: (System_Memory_0x1e00000000000000,_System_Memory_0x1e00200000000000).tsv
Only in legion_prof_py/tsv: System_Memory_0x1e00000000000000_System_Memory_0x1e00200000000000.tsv
Only in legion_prof_rs/tsv: (System_Memory_0x1e00000000000000,_System_Memory_0x1e01e00000000000).tsv
Only in legion_prof_py/tsv: System_Memory_0x1e00000000000000_System_Memory_0x1e01e00000000000.tsv
Only in legion_prof_rs/tsv: (System_Memory_0x1e00010000000000,_System_Memory_0x1e00000000000000).tsv
Only in legion_prof_py/tsv: System_Memory_0x1e00010000000000_System_Memory_0x1e00000000000000.tsv
Only in legion_prof_rs/tsv: (System_Memory_0x1e00200000000000,_System_Memory_0x1e00000000000000).tsv
Only in legion_prof_py/tsv: System_Memory_0x1e00200000000000_System_Memory_0x1e00000000000000.tsv
Only in legion_prof_rs/tsv: (System_Memory_0x1e01e00000000000,_System_Memory_0x1e00000000000000).tsv
Only in legion_prof_py/tsv: System_Memory_0x1e01e00000000000_System_Memory_0x1e00000000000000.tsv

I feel like Rust is correct here, and Python seems to be not including data that ought to be included.

I am not going to put effort into fixing Python. But I mention it because someone else might care.

lightsighter commented 2 years ago

What exactly is the difference for those of us that don't read the tsv syntax?

elliottslaughter commented 2 years ago

Rust includes additional copy channels, specifically those from node 0 to any other node. It appears that Python does not include any copy channels, because it only includes channels for the nodes that were specifically included in the profiler output, and only node 0 was logged. That means that you don't get any intra-node copies in Python at all, even though the node 0 logs do include (at least some of) that data.

seemamirch commented 2 years ago

The copies are there in the Python profile - (if you expand node 0's channel) - but I can add the other channels to match the Rust profile (i.e. those copies will appear twice)

elliottslaughter commented 2 years ago

Ok, I missed that in Python.

I guess that in a normal multi-node profile, every copy is listed twice: under the source node, and under the destination node.

So it makes sense to show it under the source node only when we have a subset of nodes.

As long as we're not losing data, I can make Rust match Python's behavior.