ciscoo / cxf-codegen-gradle

Gradle plugin to generate Java artifacts from WSDL
Apache License 2.0
26 stars 6 forks source link

Task Inputs/Outs #12

Closed pschyska closed 3 years ago

pschyska commented 3 years ago

Hey,

Thank you for your work!

We are using the plugin in a new project and had a peculiar case of the build cache bringing back generated types that were long gone from the sources, so I dug into this a bit.

It appears that the plugin declares all the files in the output directory as task outputs, and also doesn't declare its inputs. This (AFAIK) prevents cacheability and incremental builds altogether, which is why I added our WSDL sources as task inputs.

I didn't find time to cleanly reproduce yet, but I think what happens is:

I glanced at the plugin code, and I think a better approach would be to either clean the output directory before calling wsdl2java (which has issues, see below), or generating into a (fresh) temporary directory under "build", and then iterating over the resulting file set and adding them as outputs individually.

On a related note: I think the default you chose for the output directory is a bit risky. "generated-sources" seems pretty common, and if another plugin in the same build uses that, this plugin would declare it as input -- this could lead to confusing/hilarious results 🙂 . In other plugins I saw a namespacing approach applied (build/generated//...).

Additionally, I think every wsdl2java task should have its own output directory, to not pick up outputs from another invocation -- this could mean that all generated types are dependent on all input WSDLs. This would maybe also have "prevented" #1 - or rather made the problem easier to find as the user would have had a duplication issue (as their WSDLs generated referenced shared types multiple times).

That the output dir can be used for multiple tasks prevents the easy solution of just clearing it before calling wsdl2java. This remains true when the plugin defaults were changed, as the user could still configure multiple wsdl2java tasks to the same output directory. Therefore, I think the best approach is the tempdir approach, then generating the output set as a collection of files, and adding only those files to the main source set as input (otherwise old build outputs would still be picked up). This would establish a clear relation between input WSDLs and their results, and make the plugin fully compatible with incremental builds and the build cache.

What do you think?

pschyska commented 3 years ago

I realized it's not really possible to add the individual outputs, as you only know after the JavaTask executed, and then gradle won't let you add to outputs. With the input and output snapshotting approach it shouldn't be necessary though, and they even explicitly allow multiple unrelated tasks re-using the same output directory (although I've read somewhere that it is still not recommended). That said, I think the main issues are:

fun getDefinitions(definition: Definition): Set<Definition> {
    val definitions = mutableSetOf(definition)
    for (imports in definition.imports.values) {
        if (imports is Vector<*>) {
            for (import in imports) {
                if (import is Import) {
                    definitions.addAll(getDefinitions(import.definition))
                }
            }
        }
    }
    return definitions
}

fun getReferencedSchemas(schema: Schema): Set<Schema> {
    val schemas = mutableSetOf(schema)
    for (references in schema.imports.values + schema.includes) {
        if (references is Vector<*>) {
            for (reference in references) {
                if (reference is SchemaReference) {
                    schemas.addAll(getReferencedSchemas(reference.referencedSchema))
                }
            }
        }
    }
    return schemas
}

fun main() {
    val wsdlReader = WSDLFactory.newInstance().newWSDLReader()
    val wsdl = wsdlReader.readWSDL(ClassLoader.getSystemResource("schema/conn/EventService.wsdl").file)
    val definitions = getDefinitions(wsdl)
    val schemas = definitions.flatMap {
        it.types.extensibilityElements.filterIsInstance<Schema>()
    }.flatMap(::getReferencedSchemas).toSet()
    println("Definitions:")
    for (d in definitions) {
        println(d.documentBaseURI)
    }
    println("Schemas:")
    for (s in schemas) {
        println(s.documentBaseURI)
    }
}

I'd be happy to send a PR, if you think it's worth it.

ciscoo commented 3 years ago

Inputs

There are no explicit inputs declared because the task extends JavaExec which tracks various items as input. One of those being the arguments to the main class which is executed. For example, given the following:

cxfCodegen {
    wsdl2java {
        register("calculator") {
            wsdl.set(file("calculator.wsdl"))
        }
    }
}

Running the task twice will yield first some work done, but then the second invocation would be skipped since nothing changed from the first run. However, changing the arguments that are passed to the wsdl2java main/tool:

cxfCodegen {
    wsdl2java {
        register("calculator") {
            wsdl.set(file("calculator.wsdl"))
+          suppressGeneratedDate.set(true)
        }
    }
}

will yield:

Task ':wsdl2javaCalculator' is not up-to-date because:
  Value of input property 'args' has changed for task ':wsdl2javaCalculator'

So there are inputs declared, just not explicitly. One thing that you pointed out which I agree with is the WSDL file itself being input. As of now, it is just a String argument to the tool, and well, a String is just that: a String.

I don't think the plugin should offer some sort of custom solution for incremental builds since that can open more issues than what it's worth because one solution can't cover all possible scenarios. Another way to look at it is by not having a custom solution means less things to maintain :wink: and I think what Gradle provides is sufficient enough. Since everything is just plain Java, there is nothing stopping a user from implementing their own custom check (e.g., upTpDateWhen) to see if a task is up-to-date.

Outputs

The plugin only configures an output directory here which defaults to generated-sources here. Yes the output directory seems pretty common, but it was a deliberate choice to align with the output directory that the Maven plugin defaults to here which is generated-sources/cxf. I'm not sure how /cxf was left off, but I do agree that each wsdl2java task should have its own output directory.

It's possible to declare the same types in unrelated WSDLs which the -autoNameResolution option helps with. But if each task had its own dedicated output directory, then the clashing would not be a problem. I also agree that having dedicated output directories maybe would have made #1 more clear.

As for cleaning before the task is executed, I think that should be an option, but user opt-in. When a wsdl2java task is created and corresponding clean task is also created which a user can choose to use manually or enable/enforce (dependsOn) always,


Takeaways:

  1. Declare/track the WSDL file as an input to the task. File implements Serializable, so Gradle can fully track the contents of the file for any changes.
  2. Change the output directory to task specific output directory. For example, wsdl2javaSomething-generated-sources-cxf.
  3. Create a clean (Delete) task for each wsdl2java task.
pschyska commented 3 years ago

Hey @ciscoo !

Yes, you are of course right, there are inputs. I was talking about inputs besides the contents of the wsdl files (and referenced schemas).

wrt. outputs: I half-implemented a change in the plugin that generates to a temporary directory, diffs its content against getPreviousOutputFiles and removes extra files in the output, but I think the clean task solution is even cleaner (iff the task is supposed to run, i.e. arguments, or contents of the wsdl file changed, just remove all previous output from the target dir and let cxf generate everything directly into that) - maybe it should be an optional task action though, to get access to previousOutputFiles of the JavaExec without having to reach into the internal model of another task. I just wish there was an easy way to see the task inputs/outputs, cache keys and exact contens of the build cache as a graph somehow, I got sidetracked a lot because I don't have much experience with gradle yet.

I'm not as concerned about the shared output directory any more as I understand the input/output snapshotting of gradle more, but it will probably prevent situations of outputs overwriting each other making all tasks never up to date.

Thanks for destructuring this issue into the 3 sub issues, I'm gonna keep commenting there.