StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

Same legion_prof versions are different #1757

Open suranap opened 1 week ago

suranap commented 1 week ago

Suggestion: Embed a version number in the profile file, preferably in text at the top. Tag legion_prof versions in git. Now I can head -1 *.prof to know which version of legion_prof I need. Then I can find the right legion_prof_#### tag and build it. And you can add an error message that says, "you need version #### of legion_prof" when it doesn't work.

I'm using cunumeric v24.06.01 from conda. They don't say what version of Legion they used. So there's no way to know which legion_prof to use without asking them.

My previous and new version of legion_prof have the same version number, but are different:

$ legion_prof -V
legion_prof 0.2406.0
$ legion_prof archive legate_0.prof
Reading log file "legate_0.prof"...
Matched 31454 objects
No Legion Spy data, skipping postprocess step
Sorting time ranges
Created output directory "legion_prof"
Writing level 0 with 1 tiles
Writing level 1 with 4 tiles
Writing level 2 with 16 tiles
Writing level 3 with 64 tiles
$ ~/legion_prof -V
legion_prof 0.2406.0
$ ~/legion_prof --archive legate_0.prof
Reading log file "legate_0.prof"...
thread 'main' panicked at src/serialize.rs:1133:5:
assertion `left == right` failed: Legion Prof was built against an incompatible Legion version. Please rebuild with the same version of Legion used by the application to generate the profile logs. (Expected version 1004, got version 1005.)
  left: 1005
 right: 1004
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
elliottslaughter commented 1 week ago

Legion knows what version it is:

https://legion.stanford.edu/doxygen/class_legion_1_1_runtime.html#ad454cb6f6219e29c1278480a3dd4d09a

It would be easy enough to embed the version in the profile, but you can already query it directly via Legion::Runtime::get_legion_version(). If cuNumeric doesn't provide a way to get this, maybe they should.

elliottslaughter commented 1 week ago

My previous and new version of legion_prof have the same version number, but are different

As for this issue, this means that cuNumeric is building Legion off of a non-stable release. Version numbers are only accurate at stable releases. If they built against a stable release (and better, through crates.io), this would not happen.

suranap commented 1 week ago

Neither cuNumeric nor FlexFlow expose get_legion_version. Legion should be an implementation detail users don't need to know.

For versions, there's (1) file format, (2) legion_prof, and (3) legion.stanford.edu/prof_viewer. Let's say I build using the same Legion commit. There might be bug fixes in later versions of legion_prof that still work on the same file format. I'd have to find the latest legion_prof commit that works on that file format.

elliottslaughter commented 1 week ago

Legion should be an implementation detail users don't need to know.

Exactly. They should provide the profiler so users don't have to guess. Everything else being discussed in this issue is a bandaid over the fact that cuNumeric and FlexFlow are failing to provide users with what they need.

There might be bug fixes in later versions of legion_prof that still work on the same file format.

The version is tracked in this file: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling_version.h

You can see the history of the file here: https://gitlab.com/StanfordLegion/legion/-/commits/master/runtime/legion/legion_profiling_version.h

The version in question is in your original comment: 1005 (in https://github.com/StanfordLegion/legion/issues/1757#issue-2522971309)

Therefore, based on scanning the history, this version got added in ea263e2adb9bd06c92ba30d34efb978a4dc0f3d3 and ought to work through just before the next merge that updates it.

lightsighter commented 5 days ago

Therefore, based on scanning the history, this version got added in https://github.com/StanfordLegion/legion/commit/ea263e2adb9bd06c92ba30d34efb978a4dc0f3d3 and ought to work through just before the next merge that updates it.

I think this is not quite accurate given how we update the version file. Often I'll bump the version number in my first commit to a feature branch that changes the profiling format, but then I make many later changes to get the feature branch to work correctly. We don't squash commits when merging feature branches (due to the mess it makes in the history), so it might only be several commits later where things will actually work. I think we'll need to be more disciplined about how we update the version file in order for this to work and only change it in the very last commit of a new feature branch before we merge it.

As for this issue, this means that cuNumeric is building Legion off of a non-stable release. Version numbers are only accurate at stable releases.

I don't think it will be reasonable for the Legate team to only build off of stable releases. Things are moving a bit too fast for that.

elliottslaughter commented 5 days ago

I think we'll need to be more disciplined about how we update the version file in order for this to work and only change it in the very last commit of a new feature branch before we merge it.

No need. The following command gives you a complete and accurate history of the versions by tracking only the merge commits, so that anything that happened in the development branch prior to merging is irrelevant:

$ git log -m -p --oneline --follow --first-parent -- runtime/legion/legion_profiling_version.h
8479cf19a Merge branch 'criticalpath' into 'master'
diff --git a/runtime/legion/legion_profiling_version.h b/runtime/legion/legion_profiling_version.h
index 9540e56f9..fb35a14c0 100644
--- a/runtime/legion/legion_profiling_version.h
+++ b/runtime/legion/legion_profiling_version.h
@@ -1 +1 @@
-1006
+1007
067e5f278 Merge branch 'backtracewait' into 'master'
diff --git a/runtime/legion/legion_profiling_version.h b/runtime/legion/legion_profiling_version.h
index 49bc2728c..9540e56f9 100644
--- a/runtime/legion/legion_profiling_version.h
+++ b/runtime/legion/legion_profiling_version.h
@@ -1 +1 @@
-1005
+1006
e032725a3 Merge branch 'fixlonglatency' into 'master'
diff --git a/runtime/legion/legion_profiling_version.h b/runtime/legion/legion_profiling_version.h
index 59c112266..49bc2728c 100644
--- a/runtime/legion/legion_profiling_version.h
+++ b/runtime/legion/legion_profiling_version.h
@@ -1 +1 @@
-1004
+1005
b38b841a4 Merge branch 'userprof' into 'master'
diff --git a/runtime/legion/legion_profiling_version.h b/runtime/legion/legion_profiling_version.h
index baccd0398..59c112266 100644
--- a/runtime/legion/legion_profiling_version.h
+++ b/runtime/legion/legion_profiling_version.h
@@ -1 +1 @@
-1003
+1004
...

Remember, it's a graph. So you have everything you need to figure out the answer, it's just a matter of finding the right graph analysis to do it.

lightsighter commented 4 days ago

That may be strictly true, but I don't think we want users to have to do that analysis just to figure out which version of Legion they need to be using to get the profiler to work.

elliottslaughter commented 4 days ago

I'm going to repeat what I said earlier in the thread:

Legion should be an implementation detail users don't need to know.

Exactly. They should provide the profiler so users don't have to guess. Everything else being discussed in this issue is a bandaid over the fact that cuNumeric and FlexFlow are failing to provide users with what they need.

When you release a product based on an unreleased version of Legion, you bear the responsibility for providing all the relevant matching tools, which in this case includes the profiler.

If you still want the bandaid, we can embed the Legion version string into the Legion Prof log format. Legion already has it, so that's trivial. And the parsing is trivial. That would allow me to generate an error message like:

This profiler is not compatible with the version of the logs provided. Please install Legion Prof from Legion version <VERSION> to parse these logs.

Where VERSION will be something like legion-24.06.0-553-gc032dab25 and thus will include the specific Git commit ID.