Task Inputs/Outs - Githubissues

Hey,

Thank you for your work!

We are using the plugin in a new project and had a peculiar case of the build cache bringing back generated types that were long gone from the sources, so I dug into this a bit.

It appears that the plugin declares all the files in the output directory as task outputs, and also doesn't declare its inputs. This (AFAIK) prevents cacheability and incremental builds altogether, which is why I added our WSDL sources as task inputs.

I didn't find time to cleanly reproduce yet, but I think what happens is:

e.g.: WSDL is updated, removing a type
wsdl2java task runs, doesn't produce that type anymore, but afterwards declares the whole folder as its output, and as the java files from the previous generation are still in generated-sources, it will include the old files
(When the build cache is used: The updated WSDL will now point to an output that it couldn't have produced. Which is what happened to us I think)
All the files in the output directory, including our undead type, are added to the source set (which goes straight into classes, the jar and our apps, obviously)

I glanced at the plugin code, and I think a better approach would be to either clean the output directory before calling wsdl2java (which has issues, see below), or generating into a (fresh) temporary directory under "build", and then iterating over the resulting file set and adding them as outputs individually.

On a related note: I think the default you chose for the output directory is a bit risky. "generated-sources" seems pretty common, and if another plugin in the same build uses that, this plugin would declare it as input -- this could lead to confusing/hilarious results 🙂 . In other plugins I saw a namespacing approach applied (build/generated//...).

Additionally, I think every wsdl2java task should have its own output directory, to not pick up outputs from another invocation -- this could mean that all generated types are dependent on all input WSDLs. This would maybe also have "prevented" #1 - or rather made the problem easier to find as the user would have had a duplication issue (as their WSDLs generated referenced shared types multiple times).

That the output dir can be used for multiple tasks prevents the easy solution of just clearing it before calling wsdl2java. This remains true when the plugin defaults were changed, as the user could still configure multiple wsdl2java tasks to the same output directory. Therefore, I think the best approach is the tempdir approach, then generating the output set as a collection of files, and adding only those files to the main source set as input (otherwise old build outputs would still be picked up). This would establish a clear relation between input WSDLs and their results, and make the plugin fully compatible with incremental builds and the build cache.

What do you think?

I realized it's not really possible to add the individual outputs, as you only know after the JavaTask executed, and then gradle won't let you add to outputs. With the input and output snapshotting approach it shouldn't be necessary though, and they even explicitly allow multiple unrelated tasks re-using the same output directory (although I've read somewhere that it is still not recommended). That said, I think the main issues are:

No declared inputs. This leads to the task being always UPTODATE regardless of input file changes, and also leads to a cache key that is just dependent on the input arguments (e.g. wsdl file name, but not contents). I think the input per wsdl2java task is all imported WSDLs and referenced Schemas. I hacked together some Kotlin code with wsdl4j. Did I forget a case where another file could be included besides WSDL import and schema imports/includes? N.B. If any referenced file is not local, it's not trivial to do the input change. The plugin could issue a warning about that, or try to download http{,s} schemas itself, add them somewhere under build/ and report that as input (that way gradle would compute the actual checksums, although a slight race condition remains when wsdl2java downloads the file again for generation). I'd be happy with a warning about unreliable change/detection and forcefully turning off the build cache in that case.

fun getDefinitions(definition: Definition): Set<Definition> {
    val definitions = mutableSetOf(definition)
    for (imports in definition.imports.values) {
        if (imports is Vector<*>) {
            for (import in imports) {
                if (import is Import) {
                    definitions.addAll(getDefinitions(import.definition))
                }
            }
        }
    }
    return definitions
}

fun getReferencedSchemas(schema: Schema): Set<Schema> {
    val schemas = mutableSetOf(schema)
    for (references in schema.imports.values + schema.includes) {
        if (references is Vector<*>) {
            for (reference in references) {
                if (reference is SchemaReference) {
                    schemas.addAll(getReferencedSchemas(reference.referencedSchema))
                }
            }
        }
    }
    return schemas
}

fun main() {
    val wsdlReader = WSDLFactory.newInstance().newWSDLReader()
    val wsdl = wsdlReader.readWSDL(ClassLoader.getSystemResource("schema/conn/EventService.wsdl").file)
    val definitions = getDefinitions(wsdl)
    val schemas = definitions.flatMap {
        it.types.extensibilityElements.filterIsInstance<Schema>()
    }.flatMap(::getReferencedSchemas).toSet()
    println("Definitions:")
    for (d in definitions) {
        println(d.documentBaseURI)
    }
    println("Schemas:")
    for (s in schemas) {
        println(s.documentBaseURI)
    }
}

No output cleaning. The wsdl2java task records previous output files, so by letting wsdl2java generate to an empty tempdir, then copying the files over to the real outputdir while recording the current file set, it's possible to use the StaleOutputCleaner to remove files that disappeared. Something like this:

static class ProcessOutput implements Action<Task> {

    private final Deleter deleter;

    ProcessOutput(Deleter deleter) {
        this.deleter = deleter;
    }

    @Override
    public void execute(Task task) {
        Wsdl2JavaTask wsdl2JavaTask = (Wsdl2JavaTask) task;
        Path source = wsdl2JavaTask.getTemp().get().toPath();
        Set<File> currentOutputs = new HashSet<>();
        try {
            Files.walk(source, Integer.MAX_VALUE).forEach(file -> {
                Path target = Paths.get(wsdl2JavaTask.getOutputs().getFiles().getAsPath(), file.toString().substring(source.toString().length()));
                Files.copy(file, target);
                currentOutputs.add(target.toFile());
            });
        } catch (IOException e) {
            throw new TaskExecutionException(wsdl2JavaTask, e);
        }

        Set<File> outputsToRemove = new HashSet<>();
        for(File previous: wsdl2JavaTask.getOutputs().getPreviousOutputFiles()) {
            if(!currentOutputs.contains(previous)) {
                outputsToRemove.add(previous);
            }
        }

        StaleOutputCleaner.cleanOutputs(deleter, outputsToRemove, task.getOutputs().getFiles().getSingleFile());
    }
}

finally, I would still add a qualifier to the default output path, e.g. build/generated/io.mateo.cxf-codegen/${name of wsdl task}, because I've seen some plugins that forcefully clear their output directories (although they shouldn't w.r.t. to Gradle's guarantees).

I'd be happy to send a PR, if you think it's worth it.

Inputs

There are no explicit inputs declared because the task extends JavaExec which tracks various items as input. One of those being the arguments to the main class which is executed. For example, given the following:

cxfCodegen {
    wsdl2java {
        register("calculator") {
            wsdl.set(file("calculator.wsdl"))
        }
    }
}

Running the task twice will yield first some work done, but then the second invocation would be skipped since nothing changed from the first run. However, changing the arguments that are passed to the wsdl2java main/tool:

cxfCodegen {
    wsdl2java {
        register("calculator") {
            wsdl.set(file("calculator.wsdl"))
+          suppressGeneratedDate.set(true)
        }
    }
}

will yield:

Task ':wsdl2javaCalculator' is not up-to-date because:
  Value of input property 'args' has changed for task ':wsdl2javaCalculator'

So there are inputs declared, just not explicitly. One thing that you pointed out which I agree with is the WSDL file itself being input. As of now, it is just a String argument to the tool, and well, a String is just that: a String.

I don't think the plugin should offer some sort of custom solution for incremental builds since that can open more issues than what it's worth because one solution can't cover all possible scenarios. Another way to look at it is by not having a custom solution means less things to maintain :wink: and I think what Gradle provides is sufficient enough. Since everything is just plain Java, there is nothing stopping a user from implementing their own custom check (e.g., upTpDateWhen) to see if a task is up-to-date.

Outputs

The plugin only configures an output directory here which defaults to generated-sources here. Yes the output directory seems pretty common, but it was a deliberate choice to align with the output directory that the Maven plugin defaults to here which is generated-sources/cxf. I'm not sure how /cxf was left off, but I do agree that each wsdl2java task should have its own output directory.

It's possible to declare the same types in unrelated WSDLs which the -autoNameResolution option helps with. But if each task had its own dedicated output directory, then the clashing would not be a problem. I also agree that having dedicated output directories maybe would have made #1 more clear.

As for cleaning before the task is executed, I think that should be an option, but user opt-in. When a wsdl2java task is created and corresponding clean task is also created which a user can choose to use manually or enable/enforce (dependsOn) always,

Takeaways:

Declare/track the WSDL file as an input to the task. File implements Serializable, so Gradle can fully track the contents of the file for any changes.
Change the output directory to task specific output directory. For example, wsdl2javaSomething-generated-sources-cxf.
Create a clean (Delete) task for each wsdl2java task.

Hey @ciscoo !

Yes, you are of course right, there are inputs. I was talking about inputs besides the contents of the wsdl files (and referenced schemas).

wrt. outputs: I half-implemented a change in the plugin that generates to a temporary directory, diffs its content against getPreviousOutputFiles and removes extra files in the output, but I think the clean task solution is even cleaner (iff the task is supposed to run, i.e. arguments, or contents of the wsdl file changed, just remove all previous output from the target dir and let cxf generate everything directly into that) - maybe it should be an optional task action though, to get access to previousOutputFiles of the JavaExec without having to reach into the internal model of another task. I just wish there was an easy way to see the task inputs/outputs, cache keys and exact contens of the build cache as a graph somehow, I got sidetracked a lot because I don't have much experience with gradle yet.

I'm not as concerned about the shared output directory any more as I understand the input/output snapshotting of gradle more, but it will probably prevent situations of outputs overwriting each other making all tasks never up to date.

Thanks for destructuring this issue into the 3 sub issues, I'm gonna keep commenting there.

ciscoo / cxf-codegen-gradle

Task Inputs/Outs #12

Inputs

Outputs