broccolijs / broccoli-plugin

Base class for Broccoli plugins
MIT License
25 stars 25 forks source link

Narrowing down caching failure - RFC #11

Closed BryanCrotaz closed 8 years ago

BryanCrotaz commented 8 years ago

See [https://github.com/ember-cli/broccoli-caching-writer/issues/70] for background.

I think the bug here might be a bug/design-flaw in broccoli. It appears (from my very much outsider view as a plugin writer) that a new tmp/cache and tmp/output directory is created on each run of broccoli. This isn't inline with the stated aim of the persistentOutput option - what's the point of keeping outputs around for a subsequent run and then creating a new folder and not reusing the previous output?

Would it be better if the tmp\xxx directory name was predictable, e.g. using classname+instance number or annotation and was only cleared down if the BrocFile was changed? This could be done by storing a hash of the brocfile in the tmp folder and checking it on startup.

Plugins could now reuse their previous output as per the broccoli design goal.

Thoughts? @joliss @rwjblue @stefanpenner

stefanpenner commented 8 years ago

This could be done by storing a hash of the brocfile in the tmp folder and checking it on startup.

this hash is much more complicated then just of the brocfile.

stefanpenner commented 8 years ago

directory is created on each run of broccoli

each run, or each rebuild? We expect persistentOutput to only live for the duration of a process's life not beyond.

BryanCrotaz commented 8 years ago

so how does caching work if you run the process twice?

BryanCrotaz commented 8 years ago

for example when you want to do regular testing then do a deploy - our build takes around 6 minutes

stefanpenner commented 8 years ago

so how does caching work if you run the process twice?

The caching you see is for rebuilds (in the same process) not builds between different processes. The later is quite tricky to get correct. Remember, cache invalidation is one of those gnarly problems.

For some plugins, like broccoli-peristent-filter the inputs/output/options/dependencies is mostly known, so a more aggressive inter process persistent cache can be created. It is though, not without its own caveats (several pending issues must yet be resolved, and great care was originally taken).

In many cases, you should be able to do something like:

ember build --output-path somewhere --env production // build it once
ember test --path somewhere // if i recall, this or something like it allows an existing build to be used
// and if ember deploy is a well behaved citizen maybe it supports
ember deploy --path somewhere
BryanCrotaz commented 8 years ago

trouble is, I can create a persistent cache for archiver, but unless the incoming merge, funnel, replace etc also use one, the input files are going to change every time. And hashing every file is likely to be just as slow...

stefanpenner commented 8 years ago

And hashing every file is likely to be just as slow...

you should check, hashing is really fast (largely just limited by available IO).

What you describe is a trade-off between stale reads and accurate reads. We error on the side of accurate reads, because the alternative leads to very bad things. Continuing to improve and evolve the caching + perf story is great but we cannot sacrifice build accuracy or we will just cause much more grief.

An alternative to explore is what i described above (build once, and reuse for multiple commands).

BryanCrotaz commented 8 years ago

no, obviously build accuracy has to come first, and cache invalidation should err on the side of pessimism. I'll investigate hashing and persistent caching.

stefanpenner commented 8 years ago

@BryanCrotaz being able to build once, and then reuse for testing + deploying also seems like a big win (if it is not already possible).

BryanCrotaz commented 8 years ago

But even then, you'd come back the next day and be hit by a 6 minute startup

stefanpenner commented 8 years ago

@BryanCrotaz I also suspect, some of the slower steps. Like uglification may benefit from utilizing persistent-filter, as I suspect (for example) that your apps vendor file does not change very often..

BryanCrotaz commented 8 years ago

and you'd need to leave your build process running so as to get in-process caching, so you're in danger of kicking off another build while deploy is still in progress

stefanpenner commented 8 years ago

and you'd need to leave your build process running so as to get in-process caching, so you're in danger of kicking off another build while deploy is still in progress

builds don't mix, whats the danger?

stefanpenner commented 8 years ago

But even then, you'd come back the next day and be hit by a 6 minute startup

development warm builds should not take 6 minutes.

Our largest app is about 500kloc of app code (not addons not vendor), and the warm dev (non-prod) boots are 12s.

BryanCrotaz commented 8 years ago

to get caching, you're (I think) suggesting running two instance of broccoli with different output directories

stefanpenner commented 8 years ago

@BryanCrotaz no

BryanCrotaz commented 8 years ago

development warm builds should not take 6 minutes. I bet you're not zipping multiple copies of your resulting node_modules and injecting binaries like ffmpeg into them?

BryanCrotaz commented 8 years ago

@BryanCrotaz no

Then you're surely going to get nothing better than what we have at the moment - build taking 6 minutes before a test or deploy

stefanpenner commented 8 years ago

@BryanCrotaz it sounds like your custom steps require optimization. Most likely following the approach taken by persistent filter will lead you down a path of success.

stefanpenner commented 8 years ago

@BryanCrotaz no Then you're surely going to get nothing better than what we have at the moment - build taking 6 minutes before a test or deploy

@BryanCrotaz no, I believe you are confused. The idea is, build once, then subsequent commands should use the already built output (not build there own) to perform tests + deploy etc.

BryanCrotaz commented 8 years ago

The idea is, build once, then subsequent commands should use the already built output to perform tests + deploy etc.

That's what we're doing - but if tests show a bug (though of course we never have bugs!) then you have a 6 minute cycle to try again.

BryanCrotaz commented 8 years ago

it sounds like your custom steps require optimization

We build about 15 zip files - in 99% of cases only one of them includes new source changes, so if we could cache we'd only need to rebuild one of them

BryanCrotaz commented 8 years ago

Our custom steps are just concatenating json, eg building AWS CloudFormation templates from lots of small files. Again, these change rarely, so caching would fix this problem.

stefanpenner commented 8 years ago

@BryanCrotaz i was referring to:

I bet you're not zipping multiple copies of your resulting node_modules and injecting binaries like ffmpeg into them?

stefanpenner commented 8 years ago

That's what we're doing - but if tests show a bug (though of course we never have bugs!) then you have a 6 minute cycle to try again.

It seems like:

  1. ember s
  2. check on /tests while you develop (for fast iteration), or run ember test --server during development

would catch your test failures nice and quick.

Starting and stopping the process to run tests seems very strange.

BryanCrotaz commented 8 years ago

Can't avoid that bit - we're building for AWS Lambda, so it requires a zip of the executable node files. I've built an npm -pull plugin which only brings in the modules we're actually using, but it's still a lot of files.

BryanCrotaz commented 8 years ago

That would be great if it was an ember app - this is a broccoli problem, not an ember problem!

stefanpenner commented 8 years ago

this is a broccoli problem,

Im dubious of this.

Can't avoid that bit - we're building for AWS Lambda, so it requires a zip of the executable node files. I've built an npm -pull plugin which only brings in the modules we're actually using, but it's still a lot of files.

have you tried zipmerge ? Seems like you could zip your node_modules, then append the zip of your app code.

BryanCrotaz commented 8 years ago

Seems like you could zip your node_modules, then append the zip of your app code

Surely this has the same caching problem?

BryanCrotaz commented 8 years ago

it would be a great way to speed it up once persistent caching is working

stefanpenner commented 8 years ago

@BryanCrotaz ya, but your working on your app you can manually be aware if you have made a change to your node_modules or not. Broccoli cannot infer this, without extra work.

BryanCrotaz commented 8 years ago

Can't broccoli treat node_modules as any other source folder?

stefanpenner commented 8 years ago

Can't broccoli treat node_modules as any other source folder?

It does

BryanCrotaz commented 8 years ago

then why can't broccoli infer this?

stefanpenner commented 8 years ago

then why can't broccoli infer this?

It can, it just must diff the whole fs tree. Which is why i said, "Extra work".

stefanpenner commented 8 years ago

Well, to be accurate It also would need to be aware of the node resolution algorithm and walk in both directions (up and down) looking for any ancestor node_modules and also traversing them.

BryanCrotaz commented 8 years ago

so basically to get persistent caching we need a fast way to diff a requires tree. Early exit on the first diff found would be fine (pessimistic).

BryanCrotaz commented 8 years ago

and using file hashing rather than last modified so that other plugins that copy files wouldn't break downstream caches

BryanCrotaz commented 8 years ago

being clever with symlink recognition would hopefully shortcut some steps

stefanpenner commented 8 years ago

@BryanCrotaz well its more then just then, all inputs must be part of the cache. This is quite tricky, and also quite costly to perform.

I really think you should focus on maintaining the zip of your deps manually (just add a postinstall hook or something to invalidate it), and using zipmerge as part of the build.

stefanpenner commented 8 years ago

other plugins that copy files

do plugins still do this? Our builds are without such steps.

BryanCrotaz commented 8 years ago

can you add a global postinstall hook? you'd need it to fire on every install

BryanCrotaz commented 8 years ago

do plugins still do this? Our builds are without such steps.

Yes - every time you rerun broccoli

stefanpenner commented 8 years ago

can you add a global postinstall hook? you'd need it to fire on every install

Sounds like something you should investigate. I am merely hinting at a simple solution.

stefanpenner commented 8 years ago

and using file hashing rather than last modified so that other plugins that copy files wouldn't break downstream caches

this is how broccoli-persistent-filter works.

BryanCrotaz commented 8 years ago

so if it works for broccoli-persistent-filter, why can't it work for broccoli-caching-writer?

stefanpenner commented 8 years ago

so if it works for broccoli-persistent-filter, why can't it work for broccoli-caching-writer?

it could, but not in the general case as input/output is typically entirely opaque for caching-writer. Where as broccoli-filter the input is known, as the filter is responsible for input reading, and passing it to the processString.

BryanCrotaz commented 8 years ago

I really think you should focus on maintaining the zip of your deps manually (just add a postinstall hook or something to invalidate it), and using zipmerge as part of the build.

npm does have global scripts, however the dependencies are generated by walking the requires tree, so the deps are dependent on the source. Darn.

stefanpenner commented 8 years ago

npm does have global scripts, however the dependencies are generated by walking the requires tree, so the deps are dependent on the source. Darn.

Write

you should read the source for persistent-filter, as we have sorted out most of that issue.

BryanCrotaz commented 8 years ago

is there scope here for a clone of persistent-filter which takes in a tree and outputs a new tree rather than a 1:1 map? The whole output is recalculated if any of the inputs have changed?