JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.43k stars 5.45k forks source link

Monorepo Packages via Submodules/Subpackages: Support in Pkg and Julia Compiler #55516

Open ChrisRackauckas opened 4 weeks ago

ChrisRackauckas commented 4 weeks ago

There are many things to be said about how to grow and scale repositories. However, I think it's clear that OrdinaryDiffEq.jl is a repo that is currently hitting the scaling limits of Julia and thus a good case to ground this discussion. It's a repository which is very performance-minded, has many different solvers (hundreds), and thus has many optimizations around. However, in order to be usable, it needs to load "not so badly". With the changes to v1.9 adding package binaries, we could get precompilation to finally build binaries of solvers and allow for first run times under a second. We need to continue to improve this, but job well done there.

However, that has started to show some cracks around it. In particular, the precompilation times to build the binaries are tremendous. At first they were measured close to an hour, but by tweaking with preferences we got it to around 5-10 minutes. The issue is that, again, there are many solvers, so if you want to make all solvers load fast, you need many binaries. Julia's package binary system is on the package level, so you only have a choice of whether to build the solvers of the package or not. We then setup Preferences.jl and by default tuned it down a bunch, but what that really meant is that while there is a solution to first solve problems, it's simply not usable in the vast majority of the cases by most users.

Also, one of the issues highlighted by @KristofferC was that some of the solvers were still causing unsolvable loading time issues simply by existing in the repo. As highlighted in https://github.com/SciML/OrdinaryDiffEq.jl/issues/2177, for example the implicit extrapolation methods, which are a very rare method to use, caused 1.5 seconds of lowering time. This is something that we had discussed as potentially decreasing when the new lowering would come into play, but not to zero and likely still relatively non-zero, and so it was determined that post v1.10 we were likely to see no more improvements due to Julia Base simply because the repo is too large. Too much stuff = already saturated in load time improvements.

However, since the unit of package binaries is the package, the next solution is to simply make a ton of packages. Thus OrdinaryDiffEq.jl recently did a splitting process that changed it from 1 package into 30 packages. That's somewhat intermediate, I assume we'll get to about 50 packages over the next year as we refine it down to the specific binaries people want. There are a few interdependencies in there as well: everything relies on OrdinaryDiffEqCore, and implicit methods have a chain like OrdinaryDiffEqCore -> OrdinaryDiffEqDifferentiation -> OrdinaryDiffEqNonlinearSolve -> OrdinaryDiffEqBDF. This is all just libs in the same repo, so it's a tens of package monorepo. This effectively parallelizes package binary construction, only causes loading what the user requests (thus fixing the "too much stuff" problem), and allows for solvers to all be set to precompile and has an easy user-level way to pick the subset of the binaries they want (by picking the solver packages).

It also has a nice side benefit that the dependencies of most solvers are decreased. Since for example only the exponential integrators need ExponentialUtiltiies.jl, that's a dependency of the solver package but not the core now, meaning most people get less dependencies.

To an extent, this is simply using monorepos and tons of packages to solve the problem, since our one hammer is package binaries just make everything a package. However, there are several downsides to this approach:

  1. For one, the registration process is a bit of a nightmare. Any sweeping change requires that we register 30 packages, which is something that must be done in JuliaRegistrator comments, each must be a different post. I.e. you cannot do a multiline registration, and thus it cannot be done with a simple copy paste, you need to manually copy and paste 30 different lines and if you miss one it will take a long while to figure out that you forgot to release one out of the 30 packages. Even just getting the packages registered we're forgetting which ones are already released and which ones are not.
  2. Inter-Dependency management is a bit of a mess. This kind of monorepo setup naturally has deep dependencies between the solvers and the core repo, since they are not made to be working through public API but internal API. You can either try to be very diligent with minimum bound bumping (which cannot be tested with the current Pkg resolver, already discussed with @StefanKarpinski), or you can lock all solvers to versions of Core. The latter sounds like even more dependency hell so we're trying the first for now. But in theory the lock-step of such a monorepo could be enforced automatically: doing it by hand is very error-prone though and makes it so you have to bump all 30 solvers every time you make a change to OrdinaryDiffEqCore, making (1) a really big problem again.
  3. The testing of package-wide interfaces is a bit wonky. For all CI we simply always download all 30 packages because doing anything else isn't automatable with our current tools.
  4. Contributing to the package is hard because in order to even install it you need to install all 30 packages, which is not normal or obvious.

As a result, it's not a great experience, but it's the best that we have to scale today.

What's the real issue and solution?

The real issue here is that we have privileged packages in a way that we have not privileged modules. In a sense, OrdinaryDiffEqCore is a submodule, OrdinaryDiffEqTsit5 is a submodule. They are all submodules of the same package. However, since we only have dependencies, binary building, etc. as package level features, we put all of these submodules into different packages. Then:

using OrdinaryDiffEqCore, OrdinaryDiffEqTsit5

is "the solution", and we have 50 packages roaming around that are actually all submodules of 1 package. In reality, what would be really nice is to use the submodule system for this. For example:

using OrdinaryDiffEq: OrdinaryDiffEqCore, OrdinaryDiffEqTsit5

If these were submodules of OrdinaryDiffEq, we could just have one versioned version of the package and all of our issues would go away... if the following features were supported:

  1. Dependencies defined on a per-submodule basis. So you could for example have a dependency that is only added in some sense if the user requests access to the OrdinaryDiffEq.OrdinaryDiffEqTsit5 submodule. If done correctly, it could also handle some option dependency issues like https://github.com/SciML/LinearSolve.jl/issues/524.
  2. Separate package binaries per submodule. Precompilation could in theory could be done per submodule of a package, not simply based on the main package module. And it could parallelize that precompilation process based on the dependencies between modules if that was made explicit instead of implicit.
KristofferC commented 3 weeks ago

So you could for example have a dependency that is only added in some sense if the user requests access to the OrdinaryDiffEq.OrdinaryDiffEqTsit5 submodule.

This sounds fishy to me. How would this look?

ChrisRackauckas commented 3 weeks ago

Maybe a section, somewhat like the extensions section.

[submodules]
Extrapolation = ["NonlinearSolve", "LinearSolve"]

where those are given as weakdeps and only considered dependencies if the downstream package somehow declares that OrdinaryDiffEq.Extrapolation is a dependency as well, maybe a section like

[submoduledeps]
OrdinaryDiffEq = ["Core", "Extrapolation"]

then you can use using OrdinaryDiffEq.Extrapolation. If you don't declare it, you get an error about importing a non-dependency, like the one that exists currently for packages that aren't in the Project.toml.

KristofferC commented 3 weeks ago

So it is a "subpackage" scoped to the "parent package" with us own deps and compat (which would go into the registry) but it always have the same version as the parent? Or is that also separate?

ChrisRackauckas commented 3 weeks ago

So it is a "subpackage" scoped to the "parent package" with us own deps and compat (which would go into the registry) but it always have the same version as the parent?

That's probably a good way to put it. It's a package that's always scoped to be the same version as the parent but can add some extra deps and you can using it for some more functionality.

bvdmitri commented 3 weeks ago

That's probably a good way to put it. It's a package that's always scoped to be the same version as the parent but can add some extra deps and you can using it for some more functionality.

This sounds to me like a small extension package that primarily defines extra dependencies with little to no code, meaning it doesn't require frequent updates. If there is any code, it would be minimal, such as defining the solver and configuration structs, which also wouldn’t need regular updates. All the actual functionality resides in the 'main' package under extensions and is only available (and precompiled) if the user explicitly adds this 'thin' package to their environment.

In this setup, OrdinaryDiffEqBDF would not depend on OrdinaryDiffEqCore, but only on the additional dependencies required for the solver, if any. OrdinaryDiffEq would then include the extension code if OrdinaryDiffEqBDF is present in the environment.

ChrisRackauckas commented 3 weeks ago

All the actual functionality resides in the 'main' package under extensions and is only available (and precompiled) if the user explicitly adds this 'thin' package to their environment.

No, not necessarily. If it was small and thin then it wouldn't effect load time all that much.

I would think in theory if we had all of this functionality and time was easy to come by, we would probably just make all of SciML one big metapackage so that it's all versioned together, and the LinearSolve.jl, NonlinearSolve.jl, etc. would all just be using SciML.NonlinearSolve. Then all of our versioning issues would go away and people would just track the SciML version.

OrdinaryDiffEq.jl solvers are already rather substantial in size in some cases, like OrdinaryDiffEq.StabilizedRK is a few MBs of code. So I wouldn't use things like "small" or "thin" to describe it, the smallness is definitely not a necessary property nor a requirement for it to be useful. That's one use case it might be used for, for example I could see Makie using this structure and then using Makie.Cairo instead of having a separate package CairoMakie. But even in that case, those sublibs are not always small. It's more that it's all versioned together.

bvdmitri commented 3 weeks ago

The term ‘thin’ might have caused some confusion. What I meant is that the few megabytes of code are not in the ‘thin’ package itself, but rather in the main package under extensions. This code would only be available (and precompiled) if the ‘thin’ package is explicitly added to the environment. The ‘thin’ package would essentially be empty, with a Project.toml file that only defines additional dependencies (if any).

nsajko commented 3 weeks ago

I think this ticket conflates several issues that should perhaps be considered separately. E.g., point one:

For one, the registration process is a bit of a nightmare. Any sweeping change requires that we register 30 packages, which is something that must be done in JuliaRegistrator comments [...]

Isn't this more of a tooling/infrastructure issue? NB: FTR: the registrator is also available on the JuliaHub site, maybe that's more convenient than using it on GitHub?

ChrisRackauckas commented 3 weeks ago

I think this ticket conflates several issues that should perhaps be considered separately.

The package infrastructure issue is held in tandem with it because if it was straightforward to keep 50 packages all tied together and updated in lock step then there might not be a need for any submodule features. One solution could be that package infrastructure just gets so much better that registering 50 packages all at the same time is simple enough to be a single command. Right now it's a lot of manual overhead and pretty error prone, so one would consider not using separate packages at all, but that doesn't mean that's the only potential solution.

I leave it as an issue to discuss the general phenomena, and suggest one path to resolution but leave it as open for debate as to whether that's the right path.

Suavesito-Olimpiada commented 3 weeks ago

Maybe a section, somewhat like the extensions section.

[submodules]
Extrapolation = ["NonlinearSolve", "LinearSolve"]

where those are given as weakdeps and only considered dependencies if the downstream package somehow declares that OrdinaryDiffEq.Extrapolation is a dependency as well, maybe a section like

[submoduledeps]
OrdinaryDiffEq = ["Core", "Extrapolation"]

then you can use using OrdinaryDiffEq.Extrapolation. If you don't declare it, you get an error about importing a non-dependency, like the one that exists currently for packages that aren't in the Project.toml.

This all sound a lot like the [features] support of Cargo.toml (https://doc.rust-lang.org/cargo/reference/features.html). I think it's design is pretty functional. Maybe, we could look at it for an idea of such a system.

KristofferC commented 3 weeks ago

I would think in theory if we had all of this functionality and time was easy to come by, we would probably just make all of SciML one big metapackage so that it's all versioned together, and the LinearSolve.jl, NonlinearSolve.jl, etc. would all just be using SciML.NonlinearSolve. Then all of our versioning issues would go away and people would just track the SciML version.

But it would also mean that if you wanted to make a small bugfix or docfix to LinearSolve users would have to "bump" the version of SciML and redownload / recompile everything?

And if you only depend on LinearSolve, it would be annoying to have its version get bumped just because of a bugfix in some other "subpackage"?

ChrisRackauckas commented 3 weeks ago

But it would also mean that if you wanted to make a small bugfix or docfix to LinearSolve users would have to "bump" the version of SciML and redownload / recompile everything?

Yes, and I don't know if all of SciML is the right granularity, maybe just DifferentialEquations, I'd have to experiment with it. But definitely for example DelayDiffEq and StochasticDiffEq needs isdefined and a matching bump with many OrdinaryDiffEq versions, so those are massive repos but would be perfect to merge into this, and if solver-split on both we'd be at like 70 modules already.

maleadt commented 3 weeks ago

CUDA.jl/cuDNN.jl/cuTENSOR.jl/... is another possible candidate for a feature like this. There's currently quite tight coupling between CUDA.jl and its subpackages, so I wouldn't mind having to bump everything in lockstep (we currently already use tight compat bounds to minimize the risk of incompatibilities, https://github.com/JuliaGPU/CUDA.jl/blob/4918437ad8530bbe6f7d4613af2e82d53e968801/lib/cudnn/Project.toml#L14).

EDIT: another example is CUDA.jl's profiler, which is fairly heavy to load: https://github.com/JuliaGPU/CUDA.jl/issues/2238

cuihantao commented 3 weeks ago

While I don't have the expertise to comment on the diagnosis, I recalled reading the SciML Small Grant project for separating OrdinaryDiffEq into subpackages for improving precompilation. I immediately felt something was off with top-level implementations. One of my packages was subpackaged to speed up the development-time precompilation and then combined again to work around obscure dependency issues. I hope this Issue gets enough attention for a permanent fix, so that regardless of how the author structures it, packages can precompile as fast as the current (manual) best practices.

fingolfin commented 1 week ago

Just to say that I also have a project (not yet in the registry) which has a ton of (sub-)packages in a monorepo. There, however, I'd rather not have the package versions in lockstep. But it sure would be nice if I could issue a single register command and it would release all of the packages in the repo for which the Project.toml has a version newer than what is currently in the registry.

If such a feature would be in principle acceptable, that'd be great.