Open muraj opened 1 month ago
Mentioning @elliottslaughter @lightsighter @alexaiken for visibility, feel free to add whoever might be interested.
First, I think there are actually three separate questions here:
Having thought about these questions quite a bit now, I'll go ahead and be the first one to put a stake in the ground. I'm going to try to balance several competing concerns in a way that I personally would find reasonable. Others may feel differently and I won't be offended.
With regards to the first question, I think the answer here is probably yes. I think it's time for Realm to step out of Legion's shadow and stand on its own. We do want to attract other users that might just want to use Realm and they shouldn't need to download all of Legion to be able to do that. We've already done the hardest part by keeping the Realm and Legion codes themselves separated. Separating build systems and tests won't be easy, but it's easier than separating code. This may make it more likely for Realm bugs to slip in while Realm is still under-tested, but that should incentivize better Realm testing. This will make our lives a bit harder on the Legion side of the world, but I think I'm ok with that if it gives the Realm team some more autonomy and visibility.
For the second question, it's my preference that the "canonical" Realm repository remain under the Legion github organization. Realm has been developed under the Legion github organization its whole life, and if something happens at NVIDIA I don't want the Legion project to lose control over Realm. In practice, I think for me this means four things:
Other than that, I don't think there's anything else that must happen on the Realm repo in the Legion github organization. Planning, CI, development, etc can all happen somewhere else if so desired.
And perhaps you can now guess where that leaves me on third question. I think we probably do want to allow the Realm team at NVIDIA to make a fork of the "canonical" repo and put it inside an NVIDIA-owned github organization to make use of NVIDIA's CI resources and have the day-to-day development take place there.
At least for me, I have a couple of preferences for how this work, some of which are more of a deal-breaker than others.
I'll note that the worse case scenario here is that a bug is introduced into Realm that continues to manifest only in Legion applications and it goes several quarters without being found and fixed so Legion ends up getting stuck on a 3, 6, 9, or 12 month old version of Realm. If we see signs of that occurring we will need to probably revisit the conversation about the third question and whether we need to move day-to-day development of Realm back inside the Legion github organization so Legion users can help Realm test itself more robustly until it ramps up its own testing more.
I think this approach balances several different competing concerns in a practical way. It gives the Realm team both more autonomy and control over how they manage Realm, while at the same time placing more responsibility on them to rigorously test and maintain Realm since they won't have the crutch of Legion constantly finding and reporting all the things that break in Realm while it is under development. It allows us to make use of the NVIDIA CI resources without the Legion organization needing to give up ownership of Realm.
Thanks @muraj for initiating the discussion and @lightsighter for the feedback. Seems like all the right questions have already been asked here.
Should we separate Realm into its own repository?
This is a 'yes' for me, and my main objective is to attract new users. A larger user base generally increases the pressure on the runtime's robustness, which I believe will boost Realm's improvement in multiple areas (quality, features, etc.). One idea that's been discussed is to start organizing Realm's source, CMake/Make files, tests, and documents into a separate directory (which isn't the case today). When the time comes, separating it into its own repo would simply involve taking this directory out.
If we do separate Realm into its own repository, who owns the "canonical" version of the Realm repository?
The answer to this question in my opinion should be considered alongside the next one. How will day-to-day development look if Realm's ownership changes? First, I want to understand whose call this is to make. If it’s a collective decision, we should take steps to gather feedback from everyone, as I feel that very few people are aware of this initiative at the moment—or perhaps just too few truly care? Feedback from the retreat will be good..perhaps don't have to wait that long. If we don't have enough people at the legion meeting for this, we should probably consider inviting them or just reaching out offline with the "heads up" asking for feedback.
Personally, I have no objections to the ownership remaining with Legion. However, I do have objections about decisions that could complicate the DevOps side of Realm. The "two-repository hybrid approach" introduces some overhead but certainly gives the necessary middle ground. Taking advantage of NVIDIA CI resources would be another core objective here.
Lastly, I think it would be reasonable to document in detail how day-to-day development workflows will operate under the proposed "hybrid" approach (partially already summarized above), and have everyone sign off on it. If this approach ultimately hinders the Realm team's velocity or leads to somewhat negative trajectory, I don't see why it couldn't be reconsidered in the future.
Realm's testing has to become good enough to warrant it being separated from Legion and Legion users can't bear the responsibility of helping to find and fix bugs if the day-to-day development for Realm is happening outside of the Legion github organization. This one is non-negotiable for me.
Yes, I keep repeating the same thing in a loop. This is my biggest concern regarding the whole 'standalone Realm' initiative. We've already defined the testing milestones and made a collective effort, but the required velocity just isn't there yet. Root cause analysis, bug fixes, and the 'occasional' higher-priority feature work oftentimes consume significant bandwidth. The unit testing of the core Realm subsystems, which are runtime-dependent, requires refactoring. The integration tests need an audit, and ideally, we should understand what coverage will be lost by losing the Legion/Regent tests. We should probably apply more pressure to get this done.
This issue is more to keep track of and discuss different ways to structure Realm as a standalone library and to leverage NVIDIA CI for testing, following a discussion from the Legion meeting on 10/23/2024
Currently, Realm lives within Legion in a sort of mixed mono-repo style system, which has worked well enough for a while since most of Realm's users are also Legion users. As Realm develops more, Realm is having more and more direct users. This leads to the conclusion that Realm really needs to stand on it's own as it's own library.
Additionally, Legion CI's resources are not enough to capture all the configurations and implementation bugs that have arisen as a result of the large amount of engineering efforts on Realm. In order to manage this, several NVIDIA employees contributing to Realm would like to leverage NVIDIA resources to manage CI. In order to do so though, NVIDIA needs to own the code repository and limit users' access to the NVIDIA CI test machines. There are many ways to manage this:
1) Move active development to a open sourced, public github repository under NVIDIA ownership. Contributors are welcome to provide pull requests (PR), but ultimately an nvidia employee needs to approve running CI on the PR and approve merging the PR in. 1.a) This would be the best case scenario for the majority of active Realm developers, especially if we can whitelist contributors to trigger CI.
2) Keep the gitlab / github repositories as they are, but move most of the active developement to an NVIDIA owned repository that will periodically make code drops to the gitlab/github mirrors. 2.a) This is a major pain for devops, as each code drop will require work to reconcile changes from upstream with the code drop.
3) Keep the gitlab / github repositories as they are, but move most of the active development to an NVIDIA owned repositiory that will be used soley for NVIDIA CI, and changes must still go through the gitlab PR approval process. 3.a) This is a major pain for development, in that it requires a large amount of our developement to go through extra hoops, which will significantly slow down the development process