Support concurrent stack invocations or document current behavior

rrnewton commented 9 years ago

I've searched for this with strings like "haskell stack tool" "concurrent invocation", but haven't found anything yet mentioning it (and instead get into all the search results regarding parallel programming in Haskell).

My understanding is that the current behavior of cabal is currently undefined for multiple processes invoked at the same time (no lock files, see haskell/cabal#2576). But it would be nice if we could fix this at the stack level.

The scenario is that a continuous integration system launches many stack build commands on the same user account. With cabal sandboxes, this is ok, because they don't modify any global package dbs. However, with stack it seems like it may be unsafe, because two different projects using the same snapshot can both attempt to install into the user-wide snapshot database.

Nix, for example, uses lock files on disk to resolve races between concurrent invocations. With stack, would it be sufficient to simply put a lock around each snapshot package db?

snoyberg commented 9 years ago

Currently there is no explicit support for multiple invocations of stack. I can think of two different approaches we can take:

Lock individual actions like the register step
Prevent two concurrent builds from occurring at all

I can imagine (2) becoming very frustrating for users, whereas (1) won't prevent some breakage from occurring. Is there a more invasive (3) that you were thinking of?

borsboom commented 9 years ago

Would it make sense to lock a whole snapshot, (e.g., only one stack invocation can be operating in ~/.stack/snapshots/x86_64-osx/lts-2.17/7.8.4/ at a time)? The "second" invocation could print a warning to say it's waiting for the lock and then just wait until the original invocation is done before continuing. If stack doesn't have to write the snapshot, then no locking happens.

And we can just say behaviour for running two stacks in the same project is undefined.

borsboom commented 9 years ago

Clarification: I'm assuming it's safe for stack to read from a snapshot that another invocation is writing to, so I'm talking about a shared (non-exclusive) lock.

3noch commented 9 years ago

@manny-fp I don't know of any locking model that allows readers to read while the data is in flux. RW/shared locks allow multiple readers but writers are always exclusive, IIRC.

Regardless, my gut sense is that package-level locks would be best. That would allow two diverging instances to eventually build different packages in the same snapshot. But I'm not sure if that's easy/possible with how it works.

FWIW, I've had great success with the filelock package on Windows (it's cross-platform). It uses native Windows APIs to avoid the need for polling (like some alternatives do) and the lock is always unlocked on process death. I can't say how well it works on other systems.

I imagine that you'd protect any write operation with a file lock.

rrnewton commented 9 years ago

I guess my main frustration is with command line tools (like cabal) that simply don't explain to users what their concurrent behavior is. Thus users can get into the bad habit of using them concurrently (in scripts, in CI, etc) and it "seems to work", but they're actually doing something incorrect which will fail nondeterministically. For example, is "cabal update" transactional? I don't know. We have a CI job that runs cabal update periodically, but it doesn't ensure that cabal installs (in sandboxes) aren't happening on the same machine.

So coming from that background, I would love it if tools which don't support concurrency would just grab a global lock (@snoyberg's option 2) and bail out saying "you can't do this".

Improving on that, I was thinking of exactly what @manny-fp said about locking the (shared) snapshots. Ideally this would use a RW lock so multiple readers can run concurrently. But @snoyberg suggests that there would still be "some breakage". What's an example of that breakage?

I guess one question is whether a set of concurrent stack install operations can have serializable semantics. I need to learn more about the package db data structures on disk, but it seems like this may in fact be possible? In particular, stack doesn't expose a delete operation, so all operations on shared snapshots are monotonic and should commute, right? (That is, the snapshot monotonically gets closer to e.g. "all LTS-2.19 installed".) So even if the locking is "medium grained" -- around each snapshot, but grabbing the lock multiple times, for each package install, not once for the entire stack execution -- then linearizability should still be possible. That is, if one stack instance installs {A,B,C} in a snapshot and another installs {A,X,Y}, any series of lock acquisitions and order of installs by the two instances, should result in the same final state on disk. And that final state should be equivalent to if each stack instance held the lock during its entire execution, forcing one to run completely before the other.

snoyberg commented 9 years ago

Things are far uglier than they seem at first, which is why anything except "enforce a single stack build can occur at once" will likely break. For example: building the exact same package with the exact same dependencies twice may result in different package IDs, and then when registering Cabal will implicitly unregister whatever already exists, causing previous install plans to break. In other words: if stack process A comes up with a build plan based on state S1, and process B comes up with another build plan based on S1, process A takes a lock and builds, and then process B takes a lock and builds, process B may end up breaking things that process A did.

rrnewton commented 9 years ago

Ah yes, the (terribly embarrassing) nondeterministic, irreproducible builds...

Ok, that would seem to argue for coarser grained locking. Each stack process could:

Grab the lock in Read mode.
Read the installed package set, and come up with a build plan.
If the build plan does not mutate the snapshot, proceed with it. Otherwise if that plan involves installs to a shared snapshot, unlock and grab again in Write mode.
Replan if the installed package set has changed, or just replan unconditionally after the write lock is acquired.
Execute the new build plan, mutating the snapshot
Release the write lock.

Any problems with that approach?

rrnewton commented 9 years ago

(There are ways to make it work better under contention -- like, using the same principle as "elimination trees", blocked processes could communicate the packages they want installed to the privileged process that holds the write lock -- but the simple approach should be good enough.)

3noch commented 9 years ago

Ouch...I didn't realize package hashes were not 100% deterministic (at least, at this level). I'm glad this is being dealt with: anything is better than brokenness.

snoyberg commented 9 years ago

@rrnewton I don't see a problem with that approach. I'd recommend we start off with this kind of locking as an optional feature, and decide later if it should be made the default. Does that make sense? And do you have enough information to take a stab at it?

rrnewton commented 9 years ago

Well, I'll probably end up asking a ton of questions somewhere ;-) (here or mailing list). To start, is there a phase distinction between installs into snapshot/shared vs. local?

I'll start by testing out the existing filelock packages, posix-filelock seemed broken right off the bat (let me get TWO write locks at the same time.. hmm).

3noch commented 9 years ago

@rrnewton I can vouch for filelock on Windows.

rrnewton commented 9 years ago

filelock seems to work fine on Mac OS with some cursory testing.

On the filelock branch on my fork I started marking some places where the locks could happen.

It looks like this may be really easy if the modifications only occur in the build function in Build.hs.

How much finer grained it could go, without substantial refactoring, I'm not yet sure. (Reading top-down it's a little hard to tell where all the file system effects are, because lots of code lives in the IO monad here.)

First, on the reader lock, there are a whole bunch of places that call taggedDecodeOrLoad and read data. It's probably better to lock at least the whole loadSourceMap call. If you look at the guts of build below, I'm not sure whether the actions between loadSourceMap and preFetch perform any reads (or writes) to disk.

   menv <- getMinimalEnvOverride

    -- FILELOCK: a coarse-grained option would be to lock around this whole action:
    (mbp, locals, extraToBuild, sourceMap) <- loadSourceMap bopts

    -- Set local files, necessary for file watching
    stackYaml <- asks $ bcStackYaml . getBuildConfig
    liftIO $ setLocalFiles
           $ Set.insert stackYaml
           $ Set.unions
           $ map lpFiles locals

    (installedMap, locallyRegistered) <-
        getInstalled menv
                     GetInstalledOpts
                         { getInstalledProfiling = profiling
                         , getInstalledHaddock   = shouldHaddockDeps bopts }
                     sourceMap

    baseConfigOpts <- mkBaseConfigOpts bopts
    plan <- withLoadPackage menv $ \loadPackage ->
        constructPlan mbp baseConfigOpts locals extraToBuild locallyRegistered loadPackage sourceMap installedMap

    -- FILELOCK: does prefetch store in shared space?
    when (boptsPreFetch bopts) $
        preFetch plan

    -- FILELOCK: here we could release the reader lock, and test if we need the writer lock.
    if boptsDryrun bopts
        then printPlan (boptsFinalAction bopts) plan
        else executePlan menv bopts baseConfigOpts locals sourceMap plan

Currently executePlan interleaves local and snapshot installs, so there doesn't seem to be a good way to "finish all snapshot installs, then release lock". I was hoping it would be possible to unlock during most of the local work.

Further, the caching scheme would seem to interfere with our attempt to use reader/writer locks. taggedDecodeOrLoad in turn calls taggedEncodeFile, which write to disk and is non-atomic. It could be made atomic by writing into a temporary file and moving it into place, but that would require that all the cached stuff (minibuildplans, etc) has deterministic contents and that races between concurrent processes are ok.

So.... the short term thing solution looks to be to just lock (write/exclusive) around the entire build action. But the other actions would need to be considered as well. Certainly "update" and "upgrade" write to shared areas of disk.

rrnewton commented 9 years ago

Update: this commit contains a simple version as a starting point.

It seems to work. I've been testing it by concurrently performing stack actions, e.g. (stack foo &); stack foo and making sure things work ok. (And tests pass of course.)

Michael, how would you like it parameterized? It doesn't seem worth a command line flag if it will ultimately be always on. (That is, I don't think it's a desired feature to allow unsafe concurrent commands with dubious properties.) Maybe a config option in the global yaml file?

3noch commented 9 years ago

Yes, I'm not sure stack --allow-breakage would be an attractive option, nor would stack --dont-explode solicit understanding from end users. :wink:

rrnewton commented 9 years ago

Moving discussion over to the pull request (#670) and closing this issue.

3noch commented 9 years ago

@rrnewton Always great to meet a fellow Hoosier. :)

commercialhaskell / stack

Support concurrent stack invocations or document current behavior #643