Refactoring proposal : yet another move towards cabal

commercialhaskell / stack

The Haskell Tool Stack

http://haskellstack.org

BSD 3-Clause "New" or "Revised" License

4k stars 843 forks source link

Refactoring proposal : yet another move towards cabal #6593

Open theobat opened 5 months ago

theobat commented 5 months ago

So, while working on backpack related processing, I realized that there are many things that stack duplicate in very slightly different ways from cabal, with no obvious benefit and pretty obvious downsides (such as conversion costs). I've come to a point where I either have to duplicate yet another huge chunk of backpack related logic into Stack because the types differ, or I can try to refactor current duplicates from Stack which make no sense. This would benefit Stack by reducing the amount of conversion and parsing logic done on stack's side (but some conversation will still be needed at places but I'd expect far less than the current state of things). This should also reduce the amount of code in stack at the cost of slightly hampered flexibility and maintenance burden for the imported logic. This issue is a request for comment for the following refactorings (That I would do if I'm allowed):

The entire DumpPackage type and parsing logic is a duplicate of the InstalledPackageInfo in cabal. The parsing function in cabal is expecting strict ByteString whereas the stack one operates on a stream of Text, but I'd expect the transition to be relatively smooth.
The GhcPkgId is the mix of many different types in Cabal (but mostly, as of now in stack, it's the mirror of the ComponentId). This is a problem for Backpack because Backpack has a notion of OpenModule (that is component which lacks implementation of its signatures) and this notion is hard to represent correctly with current GhcPkgId type.
StackUnqualComponentName should be a newtype over UnqualComponentName from cabal, there are little reasons for not having this and the current Text based version is less efficient than the cabal one.
The NamedComponent from Stack should also use the StackUnqualComponentName as the underlying representation for better interop. A Text translating function can be implemented to limit the amount of refactoring needed.

That's it, for now, I believe this is a significant enough refactoring. It would really ease the process for Backpack support.

PS: Not that while Backpack requires component based builds and they incur a significant slowdown compared to package based builds (as explained in #6356), it shouldn't be a problem for now because we'd only enable component based build for backpack packages.

RFC @mpilgrem

mpilgrem commented 5 months ago

A few questions:

Does DumpPackage exist because Stack supports a wider range of versions of ghc-pkg (and the output of ghc-pkg describe) than a single version of Cabal (the library)? Cabal says that it only supports GHC versions (and, presumably, by implication, ghc-pkg) within a five-year window. (I've not looked into the history of the InstalledPackageInfo data constructor.) My concern is that Cabal drops support for a version of ghc-pkg that Stack supports, and then we have to revert to Stack having its own type/logic.
InstalledPackageInfo has a number of fields that DumpPackage currently drops (EDT: that is, the DumpPackage type does not represent the full information provided by ghc-pkg describe, only a subset of it). If they remain redundant for Stack's purposes, is there an adverse performance aspect of using InstalledPackageInfo?
Type GhcPkgId (a newtype wrapper around Text) represents values of the id field in *.conf files. I also understand (EDIT: incorrectly, see below) that is represented by type ComponentId (a newtype wrapper around ShortText from Cabal-syntax) in Cabal-Syntax-3.10.3.0. So, I did not follow what you were proposing if both are a problem for Backback. Do you need a new type, representing something different? (Aside: the meaning of the id field is documented here: https://downloads.haskell.org/ghc/latest/docs/users_guide/packages.html#installedpackageinfo-a-package-specification)

EDIT: I now understand that the ghc-pkg describe id field is parsed as Cabal-syntax's installedUnitId :: UnitId.

theobat commented 5 months ago

I don't know, but it seems to me that the changes that occured in ghc-pkg between say, ghc 8.0 and now are minimal. As far as I know it's about adding new fields.
Precisely, and I need some of these additional fields. Now, some of the fields are kept lazy in cabal, plus the overall format is a lot more memory efficient (short text) so I suppose it revolves around the parser's performance. Cabal uses Parsec, whereas Stack use a custom hand written parser, I don't which one is more performant.
So, stack uses GhcPkgId in many places, and particularly in places where we inform the package-id of installed packages to build other packages that need them. This is a place where cabal uses a UnitId, because the id of a component contains its instantiation information.

I suppose the real issue lies in your first question, a more principled approach would be to copy a subset of the cabal InstalledPackageInfo (with the same fields and underlying types, somewhaat like I did for component types) and to copy the parser and check that it still works the way we expect it to work (and thus we can add our specific logics should the need arise, to support unsupported versions by cabal). The need I have is rather to find certain fields present in the InstalledPackageInfo with a certain type (which should not pose any problem with backward compatibility), instead of the DumpPackage one which lacks some of them or parse them in the wrong format (Text).

andreabedini commented 5 months ago

Allow me a drive-by comment, I always found the situation around InstalledPackageInfo a bit convoluted. We need to remember that cabal was a common interface shared between different Haskell compilers. The definition of InstalledPackageInfo is indeed in Cabal-syntax and ghc-pkg uses that definition for its implementation (de/serialisation) [^1].

I find it unlikely that Cabal would change InstalledPackageInfo in a way to not be able to read old InstalledPackageInfo files (which is what ghc-pkg would output). It should be possible to use the ghc-provided Cabal to read such files, or use a lover-lever parser in Cabal-syntax.

I don't know much about how things are implemented in Stack but I'd be happy to answer questions on the Cabal side of things.

[^1]: It turns out that ghc-pkg does not care about most of the fields in InstalledPackageInfo so it makes an extract in its own format (which is what package.cache is).

mpilgrem commented 5 months ago

data DumpPackage (and conduitDumpPackage) is ancient code:

https://github.com/commercialhaskell/stack/commit/7d94335aef919ac43822c4d5321771b40cfdb8da (2 June 2015)

mpilgrem commented 5 months ago

Currently, conduitDumpPackage makes use of the following ghc-pkg describe fields (the corresponding Cabal-syntax-3.12.0.0 fields in brackets):

name and version (sourcePackageId :: PackageId)
id (installedUnitId :: UnitId - Stack uses GhcPkgId†)
licence (license :: Either License License)
exposed (exposed :: Bool)
exposed-modules (exposedModules :: [ExposedModule] - Stack uses a Set)
library-dirs (libraryDirs :: [FilePath])
hs-libraries (hsLibraries :: [String] - Stack uses [Text])
depends (depends :: [UnitId] - Stack uses [GhcPkgId]†)
haddock-interfaces (haddockInterfaces :: [FilePath])
haddock-html (haddockHTMLs :: [FilePath] - Stack uses Maybe FilePath, because there is, at most, one filepath)
package-name (undocumented in the GHC guide? - Used to indicate a sublibrary)
lib-name (undocumented in the GHC guide? - Used in connection with sublibraries) (sourceLibName :: LibraryName)

† In Cabal-1.22.0.0: installedPackageId :: InstalledPackageId and depends :: [InstalledPackageId]. In Cabal-1.24.0.0: installedUnitId :: UnitId and depends :: [UnitId]. In Cabal-2.2.0.0: installedComponentId_ :: ComponentId, installedUnitId :: UnitId and depends :: [UnitId].

mpilgrem commented 5 months ago

Stack parses the output of ghc-pkg describe into its sections (one for each package), then each section into its key: value pairs, and then parses (if it needs to) the values of relevant keys. I think Stack is assuming that if ghc-pkg describe produces output, it is valid output.

Cabal-syntax has Distribution.InstalledPackageInfo.parseInstalledPackageInfo :: ByteString -> Either (NonEmpty String) ([String], InstalledPackageInfo) which handles one section only and is more elaborate (with its handling of errors and warnings). Cabal-syntax is not assuming that the ByteString is valid.

So, two questions:

is the performance 'cost' of the 'bells and whistles' of (untrusting) parseInstalledPackageInfo worth it, in a Stack context; and
should the irrelevant parts of a parsed InstalledPackageInfo value be dropped and then a 'lighter' (modified) DumpPackage type continue to be used?

mpilgrem commented 5 months ago

@theobat, personally I am fine with refactoring as long as it "does not break anything" ... but I would consider a noticable regression in performance for 'everyday users of Stack' to be "breaking something". For that reason, I would take things 'step by step' where possible, rather than put through something 'monolithic' all in one go.

The reason I am fine with refactoring is Stack's library exists only to serve its executable; Stack makes no promises about the stability of its library's API.

EDIT2: Also, the Stack project seeks to make use of Cabal types where it can; it does not seek to reinvent the wheel for its own sake, but only when there is a good reason for doing so.

EDIT1: One last thought: historically, the Stack project has tried hard to avoid String when it is not an appropriate type. I would be reluctant to undo Stack's attempts to eliminate String unless there was a compelling reason to re-introduce it in a particular context.

theobat commented 5 months ago

@mpilgrem roger that, I'll go down the "keep and refactor the DumpPackage" to get it closer to InstalledPackageInfofor now.