Purpose:

Once this is in place, we have provided proof that we can ship processing-jobs over the wire. That opens up possibilities to create a webservice that specializes at executing them. Also, it makes us invulnerable to any versioning issues with loaded classes that may have a different definition at the time of creating a mapping-script compared to when it is actually run. As a side-effect, I expect all leaking of class-definitions into JDK MetaSpace to be solved.

Prerequisites:

All scripts must inline any external dependencies
Both current sip-app and narthex must use this new interface.

Tasks

Sandbox scripts. Forbid them to import anything not shipped with the JRE.
Inline MappingCategory.groovy

runMapping API must be refactored to:

/**
 *
 * Transforms xml by applying a (Groovy) script to it.
 *
 * @param record the xml of the input-record to be mapped (transformed)
 * @param scriptCode the groovy script to be applied to
 * @param additionalContext additional variables to be bound to the script execution context.
 * @return the resulting XML
 * @throws IllegalArgumentException when the scriptCode can not be compiled by the underlying engine.
 * @throws IllegalArgumentException when any of the objects in <code>additionalContext</code>vars are not of stock JDK classes.
 */
Optional<String> transform(String record, String scriptCode, Map<String, ?> additionalContext) throws IllegalArgumentException;

Impact

Every mapping will need to be re-generated. Luckily, this happens on the fly.
Users must always match the Sip-version they use with the Narthex version in use. They must share the same sip-core.

In general, I agree with your plan for updating the MappingRunner interface. I have some important points that need to be taken into account when implementing this.

Input for the MappingRunner interface

The mappingprocesseror takes 5 external sources plus the source records to run the mapping:

Mapping file = This file contains all the information provided by the user to map from the source to the target format (as specified by the record definition - for more information on this see below). The file in written in XML and contains the following information blocks:
- facts = a key-value list of metainformation about the dataset that become variables in the groovy code. So I can refer to the spec in the mapping via string interpolation, e.g. "${spec}".
- functions = in the sip-creator it is possible to create mapping specific functions that can be reusen in multiple mappings. It takes the GroovyNode via the 'it' variable and then you can manipulate it. The functions also store input samples that can be used to validate the correct working of the function.
- node-mappings:
  - attr
    - inputPath = the path in the source data where it will take the value from
    - outputPath = the path in the record-definition where the data needs to be written to.
  - optionally groovy-code = this element contains the groovy code that is used to manipulate the source values. Each line of groovy code has its own <string></string> entry.
hints file = is a key-value TXT file that contains meta information how the source data should be interpreted.
narthex_facts.txt = is a key-value TXT file that contains the meta-information about the dataset that is managed via the narthex dataset forms and also so meta information about the narthex deployment such as: rdfBaseUrl, orgId, etc
record-definition = This XML file describes all the components the sip-creator needs to build the target format that can be mapped to. It contains the following main elements:
- short name of the record definition
- version of the record defitinion
- root of the output XML
- available attributes and attribute groups
- elem that specifify the mapping target
  - attributes, validation rules, default values
- functions - default functions that are available in all mappings that use this record definition
- validation rules
- documentation for each elem and atrtibute
record definition validation XSD = This XSD validates each output record if it adheres to the rules. This validation in run both in the Sip-Creator and Narthex during processing. Both the sip-creator and Narthex have configuration options that can disable validition.

Additionally the source data is avialable in a GZipped file. This source format is constructed in a pocket format by Narthex.

So the new MappingRunner Interface should be able to take these input sources and generate from them their internal models. I agree that it is better that these can/should be given as strings to the Interface instead of already initiated Classes.

Also for the development of a Commandline-interface (CLI), the new MappingRunner interface should be a good entry point.

Functions

There Groovy code in that is executed in the has access to functions that are defined on three levels:

Inside the mapping XML file. These functions are only available inside that mapping
Inside the record definition. These functions are available to to all mappings that are use this record-definition.
System functions that are defined in Sip-Core. These functions are not listed in the functions overview of the Sip-Creator and are available to all mappings. Examples of these are 'sanitizeURI()' and 'discard()'. All functions that are define in the GroovyNode are also available in the mapping.

If we are going to remove the ability to import classes that are on the Sip-Core classpath from functions, we must make these more complex things available as system functions. There is an example where we are using an external library to do conversion between various formats of geospatial encodings. Ideally, system functions should also be part of the list of user defined functions.

General remarks on versioning of mappings and record definitons

Each mapping is linked to a specific record definition with a specific version. All these versions are available on http://schemas.delving.org. In Hub2, the functionality to interact with the schema-repository was much more integrated. In Hub3, each sip-zip contains the right record-definition and validation XSD.

Also, note that the the facts in the mapping seem to be prioritized over the facts in the 'narthex_facts.txt' file. There is already an issue in Narthex that deals with the fact that these two sources can get out of sync, see https://github.com/delving/narthex/issues/136.

Thanks for the reply. Much of what you wrote acknowledges what I have deducted from reading the code.

I think output validation against an XSD should be outside the scope of the processor. If the calling thread wants to validate, it can do so itself by simply passing the result to a parser. It is a mistake to compound the two in a single component.
Yes indeed, all types of functions you mentioned will have to be inlined. But from the perspective of the processor, it doesn't matter where they come from. It is up to the user to make sure that the function-definitions don't clash with anything else within the script's namespace.
Doing this is a prerequisite for being able to create a nice async web-api on top of it. If our narthex and sip-app work on top of this simple API, we can make that API available over the wire. The reason I wanted simple argument types is because I wanted simple serialization. I would like to avoid Java serialization, which is a maintenance nightmare. I guess that would eliminate the need for writing a CLI.
Correct me if I'm wrong, but in the end, for the processor it doesn't matter where the facts come from. It is up to the calling thread to make sure they select the right 'facts'. Keep in mind that these facts are just a Map<String, String>.
The inlining can be implemented in the current CodeGenerator class, which is in desperate need of better test coverage and, as a result, some refactoring.
Editing groovy-code (and compiling it to see if it makes sense) should also be separated from actually running the code. Right now we have a BulkMappingRunner and a AppMappingRunner which both implement MappingRunner. What will be tricky is that our new SimpleMappingRunner will have a more modest contract: it won't be providing feedback on non-compiling code such as AppMappingRunner does. So for sip-app, we'll have to write a MappingCompilationReporter of some sort.

One question: I don't quite see what the record-definition has to do with the processing step. Can you explain? I can see how it would be involved in generating the mapping-script, but not in executing it.

For your reference, here's a recently generated copy of our [test mapping-script](https://gist.github.com/hanswesterbeek/bb26f79a2b9bed560e2aa957c772b0b5

As you can see, it imports some stuff. Once you wade through all the functions you can see is that all it really does is use Groovy's MarkupBuilder for creating XML and then return a org.w3c.dom.Node. It's not hard to always let that be a string.

Yes I am aware that the calling thread will propably have to parse that XML again and turn it into a org.w3c.dom.Document but the simplicity of the API will mean that you can use it (over the wire) from, say, a Python application such as Nave.

delving / sip-creator

Change the MappingRunner interface to only accept arguments of types that are in the Java SDK #486