Closed oehme closed 7 years ago
The input is the source code, and the output is also the exact same source code. Can gradle handle this case?
I'm only talking about the check task. The update task cannot be made incremental, for exactly the reason you said.
My largest project has 6.87MB of java source, spread across 1636 files (in many subprojects, of course :). It takes 5.6 seconds to sum their filesizes with this code in my buildscript:
println 'numFiles = ' + files.size()
long start = System.currentTimeMillis()
long totalSize = 0;
for (File f : files) {
totalSize += f.length()
}
long elapsed = System.currentTimeMillis() - start
println 'filesize_MB=' + totalSize / 1_000_000
println 'elapsed_ms=' + elapsed
Running spotlessCheck
on these files with these rules:
spotless {
java {
licenseHeaderFile
importOrderFile
eclipseFormatFile
custom 'AutoValue fix', { it.replace('autovalue.shaded.', '') }
}
}
Takes 38 seconds with a fresh daemon, 16 seconds on the second run, and levels out around 14s.
Given that it takes 6 seconds just to sum their filesizes, I think adding incremental support can, at best, take ~8-10 seconds off of a very large check task, which is about half. It'll be much much faster than any reasonable set of unit tests on a codebase this size.
@oehme How does gradle determine up-to-date? Does it trust filesize+lastModified, some kind of inotify, or take a hash?
The hardest part will be serializing custom input functions. In the example above, if the user changes the AutoValue fix
function, we want to be sure to recalculate the "clean" state. If we don't have a good way to do that, then we introduce a minor staleness bug. If spotlessCheck is very slow, and the up-to-date checking is very fast, then it would be worthwhile. But unless gradle's up-to-date checking is doing something very clever like hooking into filesystem events, it seems to me that it's not worth the complexity.
That code you showed is too simplistic. Gradle can up-to-date check projects with a 100000 source files in about 5s. We have performance tests for that. Also, in the future Gradle will even do this pre-emptively (through file system listeners), so that it already knows the up-to-date state before you even run the next build.
The problem is that the spotless check task will run even if the source code for a project did not change. Imagine you only changed one of your projects, but spotless still insists on re-running on every project. That's a big waste of time. There really is no good reason not to do up-to-date checking ;)
You can hash custom functions by hashing their byte code (which you can get from their ClassLoader).
I just saw that spotless adds a task to the root project which checks the complete source tree. This is so much unnecessary work on incremental builds. That calls for a redesign towards much more granular tasks. The plugin should add a task for every sourceSet. If the user wants to check stuff outside of the sourceSets, he can add another check task that only checks some specific files (let's say, the build.gradle files in his gradle
folder).
Basically the extension defines how to check (including the file extension, but not the full path). The task decides what to check.
spotless adds a task to the root project
It doesn't add a task to the root project, just the project. You can add it to the root project (I often do), or you can add it to each subproject independently if you prefer.
and then checks the complete source tree
It is true that, by default, it creates one spotlessJavaCheck
task for all of a single project's sourceSets, rather than creating a task per sourceSet. That's because most people want to have the same format for all of their sourceSets. If the user wants separate formatting rules for different sourceSets, that's easy to do too, but I don't see a reason to make that the default behavior.
I don't understand why one task with 500 files would be faster than two tasks with 250 files each, with or without incremental build support, so I don't understand why this causes unnecessary work. But if it does, and we need to redesign, I'm open to that :)
The problem is that the spotless check task will run even if the source code for a project did not change. Imagine you only changed one of your projects, but spotless still insists on re-running on every project.
Imagine you want to calcualte the sum of 1000 numbers, and rather than caching each incremental sum, you just recalculate the whole sum when any number changes. Supporting incremental processing is only a win if the processing is more expensive than tracking the changes. Spotless is very simple, and so it is very fast - the bottleneck is disk IO.
Until we can measure that something is a bottleneck, and we can measure that we have sped it up, I don't think we can justify adding complexity. I'm open to the idea that my benchmark is naive, but I don't see how yet. My machine is an old laptop with a magnetic disk, and it takes 6 seconds to look at the metadata of 1600 files, and 14 seconds to read them from disk, format them, and check that they are clean. If your build server has an SSD that can read the metadata for 100,000 files in 5 seconds, maybe it's also powerful enough to format them in 10 seconds?
You can hash custom functions by hashing their byte code
That's a clever idea, I like it! When gradle implements filesystem listeners, I could maaaaybe see that it might be worth the complexity of implementing this feature. But for now, I don't understand why users would experience a faster build, because I don't understand how gradle can check for up-to-date in less time than my naive benchmark.
That's because most people want to have the same format for all of their sourceSets.
Sure, the format would be specified in the extension, which all tasks reuse. But each sourceSet should have its own task. When I only change my test code, I don't want to re-check the main code. It's a waste of time.
I don't understand why one task with 500 files would be faster than two tasks with 250 files each, with or without incremental build support
You are thinking way too much in terms of full builds. Gradle is all about incremental builds. When I change a single source file in a single project, especially with continuous build, I expect a response in a few hundred milliseconds, not several seconds or more.
Imagine you want to calcualte the sum of 1000 numbers, and rather than caching each incremental sum, you just recalculate the whole sum when any number changes. Supporting incremental processing is only a win if the processing is more expensive than tracking the changes. Spotless is very simple, and so it is very fast - the bottleneck is disk IO.
I won't get into an argument here, just test it yourself and you'll be suprised how much faster up-to-date checking is than you think
task foo() {
inputs.files fileTree('someLargeDirectory')
outputs.file file('dummy')
doLast {
//do nothing, just seeing how fast up-to-date checks are
}
}
I did that on a 256MB tree containing 100000 source files. ./gradlew foo
takes 1.3s.
Until we can measure that something is a bottleneck, and we can measure that we have sped it up, I don't think we can justify adding complexity.
As I showed you this is trivial to measure. Also, you are talking as if adding incremental build support was some monumental task. For a simple plugin like this, it's a weekend project at best.
Running spotless on that same filetree takes 13s when I change a single file. That's 10 times longer! Imagine how much faster it would be if it only checked that single file. Gradle makes that trivial.
The config was very simplistic, so this gets worse as you add more rules:
spotless {
format 'native', {
target '**/*.cpp', '**/*.c', '**/*.h'
customReplace 'Not enough space after if', 'if(', 'if ('
customReplaceRegex 'Too much space after if', 'if +\\(', 'if ('
}
}
Imagine how much faster it would be if it only checked that single file
We don't have to imagine, we can calculate it ;-) It would be ~12s seconds faster on a 256MB source tree. On a 26MB source tree it would presumably be ~1.2 seconds faster. The biggest source tree I use in my humble world is just 7MB, though my laptop seems to be a lot crummier than yours ;-)
If spotless is consuming 10% of a build's duration, then a 10x speedup in spotless is a 9% build speedup.
I'd be happy to merge a PR. To me, this data confirms that spotless isn't slow enough for this optimization to make it to the top of my weekend project priority queue, at least not on a "utility" basis. But it would be a fun project to learn about or demonstrate Gradle's incremental APIs! If anyone is interested in doing this project for that purpose, go for it!
After thinking about the custom function issue, it doesn't bother me that much anymore - it's fine if changing a custom function doesn't cause spotlessCheck
to recalculate. Changes to these functions will almost certainly be tested using spotlessApply
, so it's okay if spotlessCheck
is broken in this narrow way.
I corrected the number above. It is 10s slower or in other words 10 times as slow.
And remember this is for a trivial set of replacements. I'm sure it's way worse if you add more. I'll give you some more numbers later.
Currently, this is the only task in the project, FormatTask. If anybody has time and need to make the change, I'm happy to merge and release the PR. Clone this repo, run gradlew ide
, and you've got a dev environment. Here's the gradle docs on incremental builds.
I'd guess that making this change might require creating a separate FormatApplyTask
and FormatCheckTask
task, but if there's a way to avoid that it would be better to be able to release as 2.1.0
rather than 3.0.0
, but whatever the code needs :)
I will be happy to assist the contributor with this. It will definitely require a 3.0 to do this right.
We should fix some other non-idiomatic Gradle usages while doing this. For instance the formats DSL and how the plugin creates tasks from it is not very user friendly. There are lots of special characters needed like quotes, commas and parentheses. It just looks very different from other Gradle APIs. Also, the task is not available except using afterEvaluate
which makes customization look awkward. This can easily be fixed by following the same patterns that all Gradle core plugins use for their DSL. And if we're breaking API to fix the tasks, we should take this into account too.
We should track that in its own issue, but assign it to the same 3.0 milestone. @nedtwigg can you set that up? :)
Here's my design for a Gradle-idiomatic DSL for spotless. The plugin would create a pair of check/apply tasks for every 'target'. This will make the plugin faster, more flexible and easier to extend.
spotless {
//formats specify how to format certain kinds of code
formats {
// by using the named container API you get stuff like 'all' for free :)
all {
lineEndings 'GIT_ATTRIBUTES'
}
//java is a special method that returns a more specific type
java {
lineEndings 'UNIX'
}
//so is freshmark :)
freshmark {
}
//custom formats look just like pre-defined ones, no Strings or commas
misc {
//the format only specifies the file extension it cares about
//which filetrees to process is decided on the target level
extensions '.gradle', '.md', '.gitignore'
//use boolean assignment instead of no-arg methods to
//avoid as many parentheses as possible and also making it
//easy to turn off again
trimTrailingWhitespace true
indentWith '\t'
endWithNewline true
step('mySpecialStep') {
//imperative code here
}
}
}
//targets select file trees to scan and optionally which formats to apply
targets {
buildFiles {
dir(projectDir) //careful here, we don't want to recurse through the whole project ;)
fileTree("${projectDir}/gradle")
//optional, default is all formats
formats formats.misc
}
//when the java plugin is applied, it automatically adds a target
//for every sourceSet of the project
}
}
(Unrelated aside: setting the lineEndings
property is almost always a bad idea now that spotless supports gitattributes, see here for explanation)
I've created issue #32 to discuss your proposed DSL change.
We can definitely implement up-to-date checking and incremental support without changing the DSL, they're almost unrelated. Even if we decide to break the existing DSL, we should first release a version which adds incremental support for people who can't immediately make the jump to 3.0.0
.
Note to potential contributors: I will quickly merge and release any PR's which add up-to-date support or incremental build, but please don't break the DSL.
We will need to break up the tasks into two different types, so that will be a breaking change. But yes, we don't need to change the spotless
extension DSL to get incremental builds.
Breaking up the DSL to create more fine grained tasks will make incremental builds even better though, that's why I mentioned it here.
I'd be interested in looking into this.
As I'm new to writing Gradle plugins, I'd have to study the incremental tasks docs and delve into the Spotless source code to wrap my head around things, so I'll let you know both know if I manage to produce something useful that could act as a good starting point for this.
@jbduncan Cool! Note that the plugin currently scans the whole project dir, which includes stuff like the Gradle cache directory and the build directory. So you'll need to to apply at least some basic filters to its inputs before up-to-date checking will kick in.
Hi @oehme, many thanks for your response! It's not clear to me how I should go about implementing these filters you suggested, so I wonder if you could point me towards some resources which I could use to learn more about them?
Have a look at the user guide section on file trees :)
Thanks @oehme!
I've made an initial attempt at this in https://github.com/diffplug/spotless/pull/45, but there's a few things which I'm confused about, and I wonder if you can help me.
Firstly, I don't really know if I've understood the docs on incremental tasks, and so I don't know if I've used the annotations and IncrementalTaskInputs
class correctly, and I wondered if you could give me some constructive feedback.
Secondly, the tests no longer compile, but it's not clear to me what I should do to make them compile again (something about needing IncrementalTaskInputs
parameters, but I don't know if I should mock them or if there are proper test impls that I can use), so I wondered if you can point me in the right direction.
And finally, it's not clear to me how and where I should use getProject().fileTree(...)
in the code to trigger up-to-date checking, and I wondered if you could point me in the right direction with this too.
Firstly, I don't really know if I've understood the docs on incremental tasks, and so I don't know if I've used the annotations and IncrementalTaskInputs class correctly, and I wondered if you could give me some constructive feedback.
I'll leave a few comments in the PR. Generally, the annotations are the first step, so the task can be up-to-date checked. Using IncrementalTaskInputs
is the icing on the cake :) Then the task only processes the inputs that actually changed.
Secondly, the tests no longer compile, but it's not clear to me what I should do to make them compile again (something about needing IncrementalTaskInputs parameters, but I don't know if I should mock them or if there are proper test impls that I can use), so I wondered if you can point me in the right direction.
Generally I don't unit-test task classes, because they rely on quite a bit of Gradle infrastructure. Instead, I extract the processing logic into a separate class and unit-test that one. If you do want to unit-test them, then mocking the IncrementalTaskInputs
is probably the way to go.
And finally, it's not clear to me how and where I should use getProject().fileTree(...) in the code to trigger up-to-date checking, and I wondered if you could point me in the right direction with this too.
I might have been a little unclear here. It does not trigger up-to-date checking. It's just that checking the whole project dir without a filter would include constantly changing things like the Gradle cache and the build directory.
The problematic code is where the spotless plugin sets up the targets. This should have some default excludes (cache dir, build dir).
Generally I don't unit-test task classes, because they rely on quite a bit of Gradle infrastructure. Instead, I extract the processing logic into a separate class and unit-test that one. If you do want to unit-test them, then mocking the IncrementalTaskInputs is probably the way to go.
It's not clear to me yet if I do want to unit-test the task classes, but it seems like the path of least resistance ATM, so I'll go down that route for now, and if I see a way of improving things later, I'll improve them then. :)
Thanks again for your feedback @oehme. I've realised I'm now stuck again, as one of the tests is failing and I'm struggling to see how to make things go green again, and I wonder if you or @nedtwigg can help me move forward.
I pushed my most recent code snapshot to https://github.com/diffplug/spotless/pull/45/commits/6061363bc5e3053c28ab883f67f48631aee95c0a.
I grabbed your code and responded in the PR.
Hmm, I'm feeling rather lost now. I understand there's been further discussion at https://github.com/diffplug/spotless/pull/45 and https://github.com/diffplug/spotless/issues/47, but I've lost track of how it all fits together, and there's even some parts which my mind is just refusing to parse. Consequently I've not understood what it is I'm supposed to be doing now to move forward with https://github.com/diffplug/spotless/pull/45.
@nedtwigg @oehme Would one of you kindly give me a summary of what's been discussed and explain what I should do next? :)
Just pulled your latest, tests pass, great work!
The core wrinkle is that all FormatterSteps need to support Serializable/equals/hashCode, and we don't have a great way to do that for custom steps. We talked this through here and @oehme convinced me that his way was best. #47 is the one open wrinkle, we'll have to figure out what we think about it to completely close it out.
Also, I just realized another hard part - LineEnding.Policy. By default, we use GitAttributes, which depends on every .gitattributes file in the repo, as well as system-wide config files. So we've gotta make LineEnding.Policy be Serializable, and figure out how to implement that for GitAttributes.
I made some unrelated changes causing a conflict - I'm merging master into your branch and I'll upload the change and come back again with a to-do list we can work through together.
Just uploaded the merge and some minor cleanup. We need to make all of the following things properly serializable/equalsTo/hashCode. Right now a lot of these are lazily evaluated, and a naive serializable implementation will make it more eager than is necessary. Maybe that's unavoidable, depending on how gradle handles the @Input
annotations - we'll pick @oehme's brain as we go ;-)
Depending on what we decide on #47, we can either ignore these completely, or at least handle them piece-by-piece. We'll need some infrastructure for making the serialization easy, and for easily testing that the serialization is working. I've got some ideas for this, but I gotta run. I'll upload these ideas later today, feel free to upload other ideas too :)
Based on how hard it is to actually do this, maybe we'll consider bailing on this approach and using the full shortcut presented by #47.
I've uploaded a few commits. They have long commit messages that tell the story, but here's the outline:
Based on this, I think that making every built-in rule properly support up-to-date checking is fairly straightforward (though it's definitely a lot of work!).
If it looks like a generally good idea to you guys, then I think we should merge this to a new branch 3.x
and start implementing all these things as their own PRs. It also desperately needs some integration testing to ensure that the incremental builds are running and not running as we expect.
You can use TestKit to verify that your tasks are executed/skipped as expected.
Gradle evaluates inputs just before a task is executed, so no worries about eagerness.
Also don't sweat about the Serializability. We don't need cross machine/cross version compatibility. This is just about local caching and if serialization breaks with a new version then Gradle will just reexecute the task.
Literally just slapping implements Serializable
on those classes should work in almost any case.
I don't think this separation of the state and the implementation is necessary. I think it just makes it more complex. What's the gain?
@nedtwigg I wonder if the bumpThisNumberIfACustomRuleChanges()
option you created in https://github.com/diffplug/spotless/pull/45/commits/eefdc50f22143a970118c2a5b932aabc035e1b3b is really needed. If, as @oehme said, we could just slap implements Serializable
and reasonable impls of equals/hashCode on all the relevant classes, wouldn't Gradle be able to figure the rest out? Or have I misunderstood something?
@oehme You mentioned in https://github.com/diffplug/spotless/issues/47#issuecomment-257519012 that one can write "rules as standalone classes". What do you mean by that? Are there any resources we can refer to to learn more about this concept?
@jbduncan every spotless rule is just an implementation of some rule interface. It is a functional interface, so you can write it inline using the lambda syntax in Groovy. But then it will never be up-to-date, because equals/hashCode will use reference equality.
To allow proper up-to-date checking, you would just write it as a named class instead
@groovy.transform.Canonical //takes care of equals/hashcode
MySpecialRule implements FormatterRule { //FormatterRule extend Serializable
int someConfigValue
String format(String input) {
//whatever
}
}
and then in your spotless config
custom 'mySpecialFix', new MySpecialRule(42)
Thanks @oehme, your description almost makes perfect sense to me.
What I'm still unsure about is whether FormatterRule
is a type provided by Gradle or if it's something which we'd have to provide in Spotless. Could you clarify this for me?
That would be a type provided by spotless. Currently it uses Throwing.Function
, which is not Serializable
Ah, I see! Thanks. :)
I don't think this separation of the state and the implementation is necessary. I think it just makes it more complex. What's the gain?
Transparent lazy init. Each Format
consists of an eagerly-constructed list of FormatterStep
s (FormatterRule in your example above). To ensure that Spotless doesn't cause any file IO if its tasks aren't being used, most FormatterStep
are lazy: customLazy(() -> new LicenseHeaderStep(licenseFile)::format)
. This means that all the work of initializing a step doesn't need to happen until / if it's actually used.
Looking at the 51f0531 implementation of LicenseHeaderStep, you can see that in the case where a file needs to be parsed, that state is loaded lazily. It's not a big deal to load a teensy license header file. But in the case of google-java-format
, for example, people are using that plugin to get -SNAPSHOT versions from maven. So to figure out the state, we actually have to grab the latest POM and check it against the last POM we grabbed. Finding the state of all the .gitattributes
files in the repo is also fairly expensive.
I wonder if the bumpThisNumberIfACustomRuleChanges()
The only purpose of this method is so that simple rules created in this way:
custom 'quickRule', { quickExpression }
don't become second class citizens. This flexibility is Spotless' greatest strength imo - it's the reason why it was so easy to absorb FreshMark, GoogleJavaFormat, etc. No reason to require it for built-in rules, but it would be great if simple rules don't degrade performance (relative to a build which doesn't have simple rules).
Okay, thanks for the explanation @nedtwigg.
It's not clear to me at this stage which of the two options - bumpThisNumberIfACustomRuleChanges()
, or creating a named formatter action class a-la https://github.com/diffplug/spotless/issues/31#issuecomment-257549053 - should be included in Spotless, or if both should be included... so unless you have any objections @nedtwigg, I'll get started with the check boxes in https://github.com/diffplug/spotless/issues/31#issuecomment-257445295, and see where that takes us. :)
Just added fb613bc which tweaks LazyForwardingEquality to implement equals and hashCode in terms of the serialized bytes, so no need for manual imp.
Feel free to dig in to the checkboxes, however you see fit! Implement FormatterStep directly, or use the FormatterStep.Strict stuff. If it turns out to be easier to hand-roll lazy init/equals/hashCode for each class then I'm fine with axing it :)
I'm going to ship 2.4.0
which has been sitting as snapshot for a couple days, then create a new branch 3.x, and switch travis to use the 3.x branch for putting binaries into the maven snapshot repo. That way users can start testing on real projects.
Sounds good to me!
Oh sorry @nedtwigg, I've realised that I'm actually going to be rather busy with things, and I don't really know how long I'll be busy for, so I won't might not be able to work on the checkboxes in the foreseeable future. Therefore I'd be happy if you or someone else were to have a go at them instead.
No worries! Anybody interested can work through the check boxes at whatever pace they've got time for :)
Ah, I actually had a bit of time this afternoon, so I managed to do some cleanup and make GoogleJavaFormat
incremental-build-ready in branch 3.x via commits https://github.com/diffplug/spotless/commit/7ee199fb9a389e2513971887ffb1cb468d8f82b8, https://github.com/diffplug/spotless/commit/044be5f93f493de35ab48da8cea952e4ee32f51e and https://github.com/diffplug/spotless/commit/3165b83408c19cba2b508410c928696249c4953b.
Also, FYI @nedtwigg, at some point we'll need to change CheckFormatTask.formatCheck(Formatter)
so that it also accepts a parameter of type IncrementalTaskInputs
, so that Gradle can better do incremental builds.
This parameter would be used in place of BaseFormatTask.target
, so for example, given this snippet at https://github.com/diffplug/spotless/blob/3.x/src/main/java/com/diffplug/gradle/spotless/CheckFormatTask.java#L34-L47
private void formatCheck(Formatter formatter) throws IOException {
...
for (File file : target) {
...
}
we'd need to change it to the following, by my understanding:
private void formatCheck(Formatter formatter, IncrementalTaskInputs inputs) throws IOException {
...
inputs.outOfDate(details -> {
File file = details.getFile();
...
});
Roger. Also, see my comments on 3165b83.
I have a bit of time this evening, so I'm happy to continue having a go at this.
I've pushed commits https://github.com/diffplug/spotless/commit/2f984b685993c1c17801a1ebe2a15ee06ae26a5d and https://github.com/diffplug/spotless/commit/aa2cab5bfd7935647b8172d05f462be25df7cfed, which have moved things forward with EclipseFormatterStep. @nedtwigg, I wonder if you'd kindly review them for me?
I had a go at GitAttributesLineEndingPolicy, but it's not clear to me how to serialize it yet, as not all its fields are serializable yet, and two of those fields (List<AttributesRule>
s) use classes from JGit which we do not control.
@nedtwigg @oehme I wonder if I have your thoughts, if any, on how we could make GitAttributesLineEndingPolicy serializable?
The general problem with all these classes is that they have eagerly populated fields and then laziness is wrapped around them from the outside. Serialization would be trivial if the laziness was built into the classes instead. E.g. GitAttributesLineEndingPolicy should have a transient
field for the attribute rules which it reads from the file system when it is requested for the first time.
I think @oehme raises a rather good point.
@nedtwigg I think we might be able to lower the number of lines of code and make the codebase easier to learn if we make GoogleJavaFormat, EclipseFormatterStep etc. implement FormatterStep, do things lazily internally and be serializable directly.
If I have time, I'd be happy to have a go at converting the classes I've touched so far. Do either of you have any objections to this?
No objections here, but remember that they'll need to also implement equals and hashCode. If you look at frameworks like React.js, the key innovation is to make things immutable, and make the framework figure out the hard parts. It looks to me that getting rid of the FormatterStep.Strict takes the hard part out of the framework, and sprinkles it into every impl. But I might be wrong, and I'm happy to look at some code for other ideas.
I think the best next step is to get incremental working on at least one rule, and take a look at what our performance gains are.
Ah okay, you make a pretty convincing case, so I agree now that getting rid of the FormatterStep.Strict
class hierarchy might not be the best idea.
I believe I've now made both GoogleJavaFormat and EclipseFormatterStep incremental. The only thing we'd need to do now before looking at performance gains is to make sure CheckFormatTask.formatCheck
has an additional IncrementalTaskInputs
parameter, as I discussed in https://github.com/diffplug/spotless/issues/31#issuecomment-257908435.
However, I feel rather out of my depth with adding in this parameter myself, as I had a go at it a few days ago, but got stumped when I tried to refactor the rest of the code and the tests to take the parameter into account.
@nedtwigg, would you mind if I were to leave this part to you?
Implemented in 06b5e38. A lot of testing left to do still :)
Great, thank you!
The spotless check task currently doesn't have any inputs and outputs declared and is thus never considered up-to-date. This is slowing down user's builds unnecessarily.