theobat commented 9 months ago

Package component level builds

What's the point, what is it ?

Stack use cabal "simple" (which is very close to Setup.hs commands except it's a binary) to actually build packages. That means, for each package selected in the "Plan", it gathers all the info required by cabal simple and then call it. Currently stack use cabal simple through package builds, that is, for each package it calls :

"configure" without naming a component
"build" with a set of components but it has no effect
"copy" without naming a component ...etc

Component level builds is basically doing the same as before, but all the cabal simple calls are targeted at a single component of a package instead for instance :

configure sublib
build sublib
register sublib
configure exe1
build exe1
copy exe1

For a case where we have an exe1 depending on a sublib1. Note that in this case the intra-package dependency has to be handled by stack whereas it's currently handled by cabal simple.

Doing this in stack land, woud probably resolve many issues with over-building stuff, but mostly, it's a hard requirement for making backpack work (backpack cannot work with current style builds). I believe it's enough incentive to adopt this new style. Besides, it'd also bring stack closer to the cabal-install CLI.

Some architecture refactoring

In current stack, we have many occurences of "Set NamedComponent" or "Map StackUnqualCompName XX". Given the requirements for component based builds, we are going to use a lot more of those in a even more distinct flavors than now, which I don't think will scale well. We also have many occurences of Library or Executable (see Installed data type) constructors as well which again is redundant to some extent. What I propose is we replace all of these by a a few datatypes, a phantom type and a type family which would encompass all use cases through the same constructors.

First, the core data structures :


data AllLibrary collection (useCase :: UseCase) = AllLibrary
  { dcLibrary :: !(Maybe (CompInfo useCase StackLibrary))
    -- ^ The main library target or information.
  , dcSubLibraries :: !(collection (CompInfo useCase StackLibrary))
    -- ^ The sublibraries target or information.
  }

-- | This subdivision make sense because it reprensts "installable components"
data AllLibExe collection (useCase :: UseCase) = AllLibExe
  { icLibrary :: {-# UNPACK #-} !(AllLibrary collection useCase)
    -- ^ The main or sub library target or information.
  , icExecutables :: !(collection (CompInfo useCase StackExecutable))
    -- ^ The executables target or information.
  }

data AllTestBench collection (useCase :: UseCase) = AllTestBench
  { acTestSuites :: !(collection (CompInfo useCase StackTestSuite))
    -- ^ The test suites target or information.
  , acBenchmarks :: !(collection (CompInfo useCase StackBenchmark))
    -- ^ The benchmarks target or information.
  }

-- | A data structure to centralize all aspects of component collections,
-- whether it's a Set a Map or a CompCollection or whether you only want component names
-- it should all use the same data structure.
data AllComponent collection (useCase :: UseCase) = AllComponent
  { acForeignLibraries :: collection (CompInfo useCase StackForeignLibrary)
    -- ^ The foreign libraries target or information.
  , acTestBench :: {-# UNPACK #-} !(AllTestBench collection useCase)
    -- ^ The test suites target or information.
  , acAllLibExe :: {-# UNPACK #-} !(AllLibExe collection useCase)
    -- ^ The executables target or information.
  }

And then the use case type family :

-- | These all the use cases for the AllComponent type.
-- This is only meant to be used as an input for the 'CompInfo' type family.
data UseCase
  = JustNames
  -- ^ Sometimes we only need the names of the components,
  | AllCompInfo
  -- ^ Or the entire cabal info that we keep, see the "Stack.Types.Component" module.
  -- In particular packages components are represented as "AllComponent CompCollection AllCompInfo".
  | MissingPresentGhcPkgId
  -- ^ When we construct the plan for building packages, we have to track what's
  -- been installed and what's missing also at the component level.
  | InstalledGhcPkgIdWithLocation
  -- ^ When we retrieve the preexisting info from ghc's package database or the file system,
  -- we want to know for all packages the library data or executable path they have.
  | ModuleFileMap
  -- ^ In GHCi we have to keep track of the module files at the component level.
  | CabalFileMap
  -- ^ In GHCi we have to keep track of the cabal files at the component level.

type family CompInfo (useCase :: UseCase) compType where
  CompInfo JustNames _ = StackUnqualCompName
  CompInfo AllCabalInfo compType = compType
  CompInfo MissingPresent StackLibrary = GhcPkgId
  CompInfo MissingPresent _ = ()
  CompInfo InstalledGhcPkgIdWithLocation StackLibrary = (InstallLocation, GhcPkgId)
  CompInfo InstalledGhcPkgIdWithLocation StackExecutable = InstallLocation
  CompInfo InstalledGhcPkgIdWithLocation _ = ()
  CompInfo ModuleFileMap _ = Map ModuleName (Path Abs File)
  CompInfo CabalFileMap _ = [DotCabalPath]

Now this way appear a bit complicated at first, but there are many benefits to this approach :

In terms of documentation, we can see at first glance what is it that we do at the component level whereas it's kind of hard to scrap the code for all the Set NamedComponent/Map StackUnqualComName places.
The selection/targeting of components is easier this way, with the current design we have to check for the type of NamedComponent before walking through its characteristics in the Package datatype.
We have a finer set construction : it enables type safe component restricted sets (like, give me all the libraries information == AllLibrary xx yy)

Now let's look at a few examples to see how that would look like in practice :

-- | First the package Datatype would we refactored to this, arguably we should unpack it : 
packageComponents :: !(AllComponent CompCollection AllCompInfo)

-- And of course, we'd provide the equivalent selectors as before : 
packageLibrary = dcLibrary . icLibrary . acAllLibExe . packageComponents
packageSubLibraries = dcSubLibraries . icLibrary . acAllLibExe . packageComponents
packageForeignLibraries = acForeignLibraries . packageComponents
packageTestSuites = acTestSuites . acTestBench . packageComponents
packageBenchmarks = acBenchmarks . acTestBench . packageComponents
packageExecutables = icExecutables . acAllLibExe . packageComponents

Now what about Package dependencies, they have in cabal a set of main or sublibrary dependencies :

-- To represnet this fact we currently have : 
data DepLibrary = DepLibrary
  { dlMain :: !Bool
  , dlSublib :: Set StackUnqualCompName
  }
  deriving (Eq, Show)
data DepType
  = AsLibrary !DepLibrary
  | AsBuildTool
  deriving (Eq, Show)

-- That would become : 
data DepType
  = AsLibrary !(AllLibrary Set JustNames)
  | AsBuildTool
  deriving (Eq, Show)

The source files are also mapped for ghci through a Map of Named Component :

-- before : 
data PackageComponentFile = PackageComponentFile
  { modulePathMap :: Map NamedComponent (Map ModuleName (Path Abs File))
  , cabalFileMap :: !(Map NamedComponent [DotCabalPath])
  -- ... etc
  }
-- after : 
data PackageComponentFile = PackageComponentFile
  { modulePathMap :: AllComponent (Map StackUnqualCompName) ModuleFileMap
  , cabalFileMap :: !(AllComponent (Map StackUnqualCompName) CabalFileMap)
  -- ... etc
  }

The InstalledMap datatype which is providing installed things in the ghcPkg database would give :

type InstalledMap = Map PackageName (InstallLocation, Installed)
-- Now things would be a bit finer grained, components in a package can either
-- live in a snapshot or locally : 
type InstalledMap = Map PackageName (AllLibExe (Map StackUnqualCompName) InstalledGhcPkgIdWithLocation)

Now you get it, the design would be more normalized and unified, for a small abstraction cost. It's not strictly necessary to get the component based builds, but I'd say it would make it singnificantly easier. The idea is to bring in this datatype and then to refactor slowly and step by step where it makes sense.

The actual task list for the component based builds

Change ConstructPlan to account for component level installed versus to-install GhcPkgId (this would be quite significant).
Resolve intra-package dependencies (for now we don't, we let cabal decide the order of component builds)
Top-sort (probably through an insertion sort though) the package components to build (probably only the library components for now) if more than one is required.
Either subdivide Tasks into smaller parts or only refine task actions. I think for now a good step is to try to do component builds with the one-task-one-package scheme (note that we can already have two tasks per package in case of non-all-in-one builds with tests & benchmarks). That is to say, the first iteration would only bring component-build inside one package task, and then we'd enable a better datatype for task to account for component level aspects.

RFC @mpilgrem

Other issues relating to component-based builds

(EDIT by @mpilgrem) The issue/feature request of component-based builds has a long history at this repository. The following are related issues:

mpilgrem commented 9 months ago

@theobat, I can't add much to a discussion about architecture. My own concerns are simple ones: (a) don't break anything for Stack users; (b) don't make Stack slower for 'everday' use; and (c) keep the code base 'tidy'.

Currently, Stack builds a project using the version of Cabal (the library) that ships with the specified version of GHC - specifically Distribution.Simple.defaultMain (see Stack.Build.Execute.simpleSetupCode) - compiled into a small executable. (Ignoring, for the moment, the complexity of the shim at src/setup-shim/StackSetupShim.hs.) People have asked that Stack supports GHC versions for a long time (seven years - perhaps motivated by the beloved GHC 7.10.3, now no longer supported). Does that affect your plans?

theobat commented 9 months ago

Right, that makes sense. I'll ensure we keep the current existing behavior for builds with older cabals, it should only be a small number of them. Things before cabal 2.2 will be incompatible with this new way of building packages if I recall correctly, but again I'll keep the backward compatibility as a mandatory aspect.

mpilgrem commented 9 months ago

GHC 8.4.1 (released 8 March 2018) comes with Cabal-2.2.0.0, and Stackage LTS Haskell 12.0 (released 9 July 2018) specified GHC 8.4.3/Cabal-2.2.0.1 (bumping from GHC 8.2.2). I'll test again the community's (especially 'industry's') current desire for Stack to support old GHC versions.

EDIT: The 2022 State of Haskell Survey (during November 2022) yielded:

"Which versions of GHC do you use?" (Optional. Multi select.)

Proportion	Count	GHC version
10%	105	> 9.4
26%	265	9.4
48%	496	9.2
25%	262	9.0
41%	428	8.10.x
7%	76	8.8.x
7%	68	8.6.x
3%	35	< 8.6

Also: "Where do you use Haskell?" (Optional. Multi select.)

Proportion	Count	Location
76%	785	Home
49%	504	Industry
18%	192	Academia
7%	70	School

theobat commented 8 months ago

So this turns out to be more complex than I thought,n because the entire cache system is geared toward packages. For the sake of limited changes and swiftness, I'm only working on refactoring the inner component builds of an entire package for now, without moving all the bits towards the component architecture.

That is, I'm only moving the singleBuild function in the Execute module to the component world, and that's enough change on its own that any other refactoring would be harmful. It's still likely to facilitate backpack support though.

wraithm commented 5 months ago

It'd be incredible if this feature fixed https://github.com/commercialhaskell/stack/issues/2800

theobat commented 5 months ago

@wraithm it would, and I have had a functional branch with this feature in the past month or so, but the issue is that this architectural change brings a significant perf regression in "normal/traditional" builds because, for each component within a package where you have an internal dependency (e.g. exe depends on lib or lib depends on sub-lib), component based builds means we call the cabal process N times where N is the number of distinct sequential (we can't parallelize them) components. Calling the cabal process is far from negligible, on my machine it incurred a ~~25%~~ 30-40% increase in duration for the integration test suite's execution.

I'm not sure what to do with that, my initial plan was to only trigger the component based builds for backpack builds (which I've been very close to finalize, and we have no choice as far as backpack is concerned), but I've had too much work in the past few weeks to discuss this issue any further, maybe @mpilgrem you can dive in on this.

mpilgrem commented 5 months ago

@theobat, thanks for all your work on this and the update. If I understand correctly, it appears that the following are not mutually compatible and 'something has to give':

O1. Stack making use of Cabal (the library) through a compiled Setup.hs executable; O2. Peformance at historical levels for 'everyday' building of packages; and O3. Stack taking a component-based approach to building as opposed to a package-based approach.

Would I be correct to assume that Cabal (the tool) avoids the problem by not making use of Cabal (the library) through a compiled Setup.hs executable (that is, it 'gives up' O1)? Unlike Stack, each version of Cabal (the tool) uses, essentially, one version of Cabal (the library) (eg the dependency of cabal-install-3.10.3.0 is Cabal >= 3.10.3.0 && < 3.11.

The spotlight may be on O1. Why does Stack do that? A few things occur to me:

Is there any alternative, when the build-type: is not Simple and is, for example, Custom? (As an aside, this Haskell Foundation Tech Proposal RFC is that the Cabal project move away from build types other than Simple.)
The Cabal User Guide on Custom setup scripts refers to an article dated 6 July 2015 which, in turn, refers to the Cabal specification - which I think is here. What Stack does is consistent with the original Cabal specification.
Stack prioritises 'reproducible builds'. Elsewhere it has been said that Stack using the version of Cabal (the library) that is the boot package of the specified GHC is consistent with 'reproducibility'. What is not clear to me is whether that is a matter that is specific to packages that use build-type: Custom. I can't see how Stack could make use of GHC's Cabal boot package other than through a compiled executable. If each version of Stack made use of a single version of Cabal (the library), I assume Stack would have to drop versions of GHC when the Cabal project dropped them. (EDIT: For example, Cabal-3.10.1.0, released 13 March 2023, dropped support for GHC < 8.0. Although the master branch version of Stack has dropped support for GHC < 8.4, Stack 2.15.5 still supports GHC >= 7.10.)

theobat commented 5 months ago

That's mostly right @mpilgrem, I don't really know why stack defers to a sub process called during stack's execution. Maybe it was easier back then ? And also it means you can use any cabal library you want... I'm also not entirely sure what cabal the executable does since I havn't looked at it in depth, but my impression was that the "Simple" build was just using the cabal library in the same haskell process, which is indeed what you're describing : it's not doing O1. I don't know if that's a possibility for stack though... But it'd be a significant speedup, and it'd significantly fade out the perf difference between package builds and component builds.

Also note that, there are significant prospects for getting speed boosts in certain scenarios by using component based builds even compared to the package based builds, but that'd require : building only the components we want (as opposed to all the components of a package, but component by component, modulo tests and benchmarks specifics), building the unrelated components in parallel (as opposed to building only packages in parallel). All these things are yet another stack ( sic) of work, and it's not a solution to the problem at hand, that is : building component by component increases the number of subprocess we need to create to call the Setup.hs file/binary, and these sub-processes are expensive.

mpilgrem commented 5 months ago

This 9 Feb 2015 article by Michael Snoyman is referred to in the 6 July 2015 article I mention above. To put them in their historical context, Stack 0.0.1 was released on 9 June 2015. I am wondering if his experience is the origin of the 'reproducibiltiy' explanation I had read for 'O1'.

mpilgrem commented 5 months ago

A thought experiment: imagine a package that has build-type: Simple and lts-12.0 is specified (GHC 8.4.3, Cabal-2.2.0.1). Is there really a problem with 'reproducibility' if it is built with (a) Distribution.Simple.defaultMain from Cabal-3.10.3.0 (say) rather than (b) Distribution.Simple.defaultMain from Cabal-2.2.0.1 (via a compiled executable)?

wraithm commented 5 months ago

A thought experiment: imagine a package that has build-type: Simple and lts-12.0 is specified (GHC 8.4.3, Cabal-2.2.0.1). Is there really a problem with 'reproducibility' if it is built with (a) Distribution.Simple.defaultMain from Cabal-3.10.3.0 (say) rather than (b) Distribution.Simple.defaultMain from Cabal-2.2.0.1 (via a compiled executable)?

It sounds to me like the answer to that question is: yes, there is a problem with some strict definition of reproducibility, eg. Cabal can interpret fields in the package differently across versions, etc. However, maybe there's a deeper question of, "is this a real problem?" I imagine that if we just bundled a single version of the cabal library, 99.99% of things would just work most of the time. I'm sure there are some pathological examples you could come up with. It might be interesting to fully understand what caused those major and minor version changes in Cabal.

I could imagine just calling Distribution.Simple.defaultMain (or something very close) only for build-type: Simple inside of the stack exec itself, as a function call, rather than an external process. Maybe I don't understand fully the implications there. That would make it way faster, no? Of course, you would have to build Setup.hs for custom things, but I imagine that's a relatively infrequent case.

You could also conceivably imagine bundling multiple different versions of the Cabal library and calling those different library functions based on the compiler version or what have you. However, I imagine that's way too much complexity for stack to handle for something that's of nebulous benefit.

Here's maybe another question: What does cabal-install do here? The whole "gotta build and shell out to the Setup.hs exec" thing seems to me like a problem that cabal-install would also have.

IMHO, compilation speed is way way more important than stack handling all possible reproducibility cases. You can still handle reproducibility issues by just using different versions of stack in this world where stack uses one version of Cabal. Maybe my understanding is way off, but that's how I view this.

Just curious, @theobat, where does the constraint that you need to do N invocations of cabal per component come from? Is this a fundamental limitation in the Cabal library or is this constraint coming from these peculiarities of how stack is caching and building things (or something else entirely that I'm not seeing)?

theobat commented 4 months ago

I could imagine just calling Distribution.Simple.defaultMain (or something very close) only for build-type: Simple inside of the stack exec itself, as a function call, rather than an external process. Maybe I don't understand fully the implications there. That would make it way faster, no? Of course, you would have to build Setup.hs for custom things, but I imagine that's a relatively infrequent case.

Yes, that would be nice, but I think carefully removing the historical way of deferring to a subprocess is far from obvious, there's a LOT of code just to handle the ceremony of doing these calls correctly. I suppose a nice approach to this would be to try it just for component based builds. And keep the old way with a flag (tru or false by default, I don't know).

IMHO, compilation speed is way way more important than stack handling all possible reproducibility cases. You can still handle reproducibility issues by just using different versions of stack in this world where stack uses one version of Cabal. Maybe my understanding is way off, but that's how I view this

Yes, at least there should be a way to move the cursor of the current reproducibility/perf tradeoff.

ust curious, @theobat, where does the constraint that you need to do N invocations of cabal per component come from? Is this a fundamental limitation in the Cabal library or is this constraint coming from these peculiarities of how stack is caching and building things (or something else entirely that I'm not seeing)?

@wraithm I didn't make myself very clear : we only need to call the setup, build, configure etc, once per component. That's simply a requirement of cabal's own setup.hs interface. In particular, this paragraph makes it very clear :

In Cabal 2.0, support for a single positional argument was added to runhaskell Setup.hs configure This makes Cabal configure the specific component to be configured. Specified names can be qualified with lib: or exe: in case just a name is ambiguous (as would be the case for a package named p which has a library and an executable named p.) This has the following effects:

Subsequent invocations of cabal build, register, etc. operate only on the configured component.

And the constant cost of calling these scripts is roughly the same no matter if it concerns a single component or a whole package. So we simply pay a (N - 1) * Constant factor additional cost by building per components instead of per packages, where N is the number of components of a package. Note that we also benefit from (very small, I'm afraid) gains by only building the libraries in package's dependencies of a final target. It's a little hard to assess the real-world impact of this performance regression, because integration tests are doing a lot of builds sequentially. So it's some kind of "worst case scenario". But still, I'm really not comfortable with pushing a 30-40% increase on integration tests run time.

mpilgrem commented 3 months ago

I'll look into the history of Stack building (by default) with the version of Cabal (the library) that comes with the specified version of GHC as a boot library. I think that, in order to do so, Stack necessarily has to compile a separate 'Setup' executable for each GHC/Cabal combo. I am also aware that was how the original specification of Cabal - https://www.haskell.org/cabal/proposal/pkg-spec.pdf (page 3) - intended Cabal to be used.

If there was any plan to move away from that, you would have be convinced that it did not break anything for users of GHC 8.4 onwards or adversely affect the reproducibility of builds.

theobat commented 2 weeks ago

@wraithm @mpilgrem FYI, I found the cabal logic related to deferring to a subprocess or not : https://github.com/haskell/cabal/blob/master/cabal-install/src/Distribution/Client/SetupWrapper.hs#L401-L426.

It seems indeed that they use an internal method for all the Simple builds (except for some special logging aspect), deferring to the Cabal library defaultMainArgs function within the same process. I'm still trying to fathom how this impacts reproducibility, but as far as I understand for now : As long as the cabal-version range indication written in the cabal file is respected doing an internal call with the library bundled with stack should be entirely fine... And this should happen in the vast majority of cases...

mpilgrem commented 2 weeks ago

@theobat, many thanks for continuing giving this topic your attention. I'll have a look at what you've found.

commercialhaskell / stack

Component-based builds #6356

Package component level builds

What's the point, what is it ?

Some architecture refactoring

The actual task list for the component based builds

Other issues relating to component-based builds