com-lihaoyi / mill

Mill is a fast JVM build tool that supports Java and Scala. 2-3x faster than Gradle and 5-10x faster than Maven for common workflows, Mill aims to make your project’s build process performant, maintainable, and flexible
https://mill-build.org/
MIT License
2.03k stars 320 forks source link

Reuse out folder among different machines/directories #2101

Open sequencer opened 1 year ago

sequencer commented 1 year ago

I’m using mill to manage a relative big system. A usage is forking entire workspace , including 'out', sending to other machines, executed some thing else, fork and send again and again. This is used for distributing works to different nodes. however due to the absolute path in PathRef, out cannot be directly packaged and sent away. Thus I’m proposing to make the path in PathRef relative to T.ctx.workspace to reduce the dependency to environment out of build context.

lefou commented 1 year ago

We're open for discussion and development to support such scenarios. We discussed some aspects of it in the past and I'd like to reference it here FYI:

lefou commented 1 year ago

Although mill.api.PathRef is used in targets, it's not necessarily limited to it. Also, it is not limited to paths below T.workspace, so simply changing to a relative path may not work. I already wrote up an idea to let PathRefs support relative paths, which may work (see https://github.com/com-lihaoyi/mill/discussions/1400#discussioncomment-978572), but it was under the context, that we actively distribute work to other nodes. This is a bit different from your use case.

sequencer commented 1 year ago

Yes I just checked those infos befor submitting this issue. Let me know what I can contribute to make this happening.

sequencer commented 1 year ago

Actually, for the build script trigger reevaluating issue, I wonder if it’s possible to manually give each Module a version to reduce this behavior, which may give the chance to throw rebuild burden to users.

CircuitCoder commented 1 year ago

Maybe another idea:

When walking the directory subtree rooted at p (the argument path in PathRef.apply), use each file's relative path to p to update the digest.

The reasoning is that if two directory tree has the same shape and the same content, we should consider them the "same" directory in the sense of caching. In this implementation, moving the entire subtree p to another location won't change its hash.

sequencer commented 1 year ago

another(maybe unrelative) question is: Is possible to get the full reason why a target is reevaluated, I thought when input is same caching should work, but have no idea to know why input is changed in some complex cases.

lefou commented 1 year ago

Well, currently PathRef is some kind of blackbox. Once there is one bit different, it's completely different. We probably could model that differently (e.g. a tree-like structure, so we also detect changes in parts of it or can re-use sub-parts) , but I think it's probably not efficient enough. Both, speed and memory wise.

So splitting it structurally on the smallest part might be overkill. But as @sequencer suggested, just keeping the info stable relative to T.workspace (if possible) might already be good enough, or at least a good start. We could also try to keep the path-info separated from the content hash, so that the proof whether two PathRefs are of equal content is possible, even when the containing path is different. That's probably what @CircuitCoder meant? In Mill, we are interested in both information, but not necessarily so closely coupled.

I think some experiments or creating a POC wouldn't hurt and could be fun. Got for it!

About the other issue, tracking the origin of a change, this should be probably discussed separately. But once, we invent some tracking, the whole user experience might change. And I fear not to the better. It might result in more complicated API for Input/Output types, for example. That's just a feeling though. Currently, hash-based change detection isn't easy to track (although Mill evaluator knows exactly which tasks were out of date, see out/profile.json), but it's damn fast and easy to apply.

CircuitCoder commented 1 year ago

I'm trying to get minimal implementations of both approaches tested:

It seems that the CI still needs a few hours to complete. Meanwhile, I'll try to check if these changes allow us to move cache directories.

UPDATE: a test failed because the repo name is different. Unfortunately, I already have a repo named mill, so cannot fork the repo using that name. I'll try to fork it into another organization and rerun CI.

lefou commented 1 year ago

UPDATE: a test failed because the repo name is different. Unfortunately, I already have a repo named mill, so cannot fork the repo using that name. I'll try to fork it into another organization and rerun CI.

All build and test jobs should not depend on the repo name. Only jobs, that in some way do release or publishing are bound to the repo name. You can also open a draft PRs, then the appropriate CI jobs will run on Mill repo.

lefou commented 1 year ago

Maybe, it's necessary to have some predefined, named path anchors, from which we can start digesting. workspace might be one of them. coursier-cache could be another one. (In general, I really like the idea of having some easy possibility to decouple the out directory from the project directory. E.g. In autotools/make land, you can build from another location. If we can accomplish something equal in Mill, we can easily build from read-only source trees or put all build output into some RAM-backed storage. Just thinking.) These anchors are probably a bit soft and Mill decides their real path at startup time. It can also be necessary, to ditch the quick optimization of PathRef, which we use for coursier dependencies primarily.

sequencer commented 1 year ago

What I have been dreaming that PathRef can only store the hash of a file, and add an new API to access file system.

lefou commented 1 year ago

We probably need to try to resolve the PathRef against some predefined list of directories, so we can store that virtual name and a sub-path instead. Some alg like that:

// pseudo code

// defined at Mill startup time
val knownLocs: Seq[(PathName, Path)] = Seq(
  "out" -> outPath,
  "workspace" -> T.workspace,
  "coursier-cache" -> cachePath
)

// called at PathRef creation time
def createPathRef(path: Path): PathRef = {
  knownLocs.find { case (name, prefix) => path isSubPathOf prefix } match {
    case Some((name, prefix)) => // create a PathRef with sub-path and virtual base path
    case None => // create a non-portable PathRef with absolute path
  }
}

PathRef comparision is than still based on content hash and the resolved path.

lefou commented 1 year ago

Just some mockup

$ mill show main.jar
"vref:843cb117:out:main/jar.dest/out.jar"
$ mill show main.sources
[
  "vref:2318a653:workspace:main/src"
]
lefou commented 1 year ago

I experimented with this a bit. In my first iteration, I made the PathRef.sig independent of the associated path. I also implemented JSON pickling which takes some context into account, so the actual paths are relative to the specified context. Unfortunately, this alone will not automatically make the Mill cache distributabe, but at least I learned some things.

I will open a PR with the decoupling of PathRef.sig from the path soon, as it is probably a nice feature to have without any disadvantages (I think). It will not contain the context path feature though. (Edit: this is the PR: https://github.com/com-lihaoyi/mill/pull/2106)

In addition to a PathRef, which transports the information of a path and a content signature, we probably also need more types. One for a content tree (the .sig part of PathRef), but which hashCode does not change for different paths. With such a type we could refer to files/directories, where we are only interested in the content and potentially the relative file names. This might be proper input for compilers then. We need a good name though, e.g. TreeRef or ContentRef.

Additionally, we need an type to transport a path disregarding the content. This is almost a thin wrapper around os.Path, but we want to make it context aware. It's essentially a context plus a relative path. This one is needed to refer to actual files, but swap the context depending on the actual runtime location. One usage might be the compiler analysis file of Zinc, which we currently hold in CompileResult.analysisFile. We need a good name too. ContextPath?

The current PathRef could be then constructed from these two, as it depends on the actual path (which should be aware of a context) and the content signature.

To make the Mill cache distributable, we also need to refactor some targets, e.g. a compile target should not depend on targets that return PathRefs but ContentRefs instead. Same for the CompileResult. We probably need to review the whole scalalib architecture under this perspective.

lefou commented 1 year ago

SubPathRef might be a good name (for ContentRef).

lolgab commented 1 year ago

Do exist cases when paths in the .json files in out are not subpaths of T.workspace ? Could we process all the os.Paths and assume them to be relative to T.workspace? Probably not, asking just to validate this idea.

lefou commented 1 year ago

Currently, out is hard-coded, so your assumption is valid. But I'm planning to change that, or at least make it overridable via config or cli, to support more use cases like caching or memory-backed storage or read-only project directories.

lefou commented 1 year ago

@lolgab To your second question: We already store paths that are located outside of the T.workspace, e.g. coursier and ivy artifacts. Making these relative to T.workspace has more potential to cause harm than good.

lefou commented 1 year ago

PathRef.sig is also storing information about permissions. As these have different structure under Windows vs. Unix-based system, we probably won't be able to share between different OSes without further changes.