genericworkflownodes / GenericKnimeNodes

Base package for GenericKnimeNodes
https://github.com/genericworkflownodes/GenericKnimeNodes
Other
15 stars 16 forks source link

Changes needed for KNIME2Grid #171

Closed chahuistle closed 7 years ago

chahuistle commented 7 years ago

Summary of the Changes

New CommandLineElement Interface and its Implementations

KNIME2Grid is a KNIME extension that will let users export KNIME workflows to other platforms such as Galaxy, gUSE. The handling of native KNIME nodes and Generic KNIME Nodes differs in order to benefit from the fact that the tools wrapped by GKN don't need KNIME to run. For this, extra information on each of the command line parameters is needed. For instance, think of a KNIME workflow in which a tool wrapped by GKN generates a file. The command line would look like:

tool --input /var/tmp/input0.csv --output /var/tmp/output0.txt --length 5

If this workflow were to run on a different platform, the paths of the input and output files would need to be changed. The new included package com.genericworkflownodes.knime.commandline contains interfaces and classes that wrap around command line elements and can generate a string representation.

For the sake of brevity, let's assume that our tool had a custom command generator that hardcoded all values (i.e., an implementation of an ICommandGenerator). Before this refactoring, the method that generated the command line would have looked like:

public List<String> generateCommands(/* params */) {
  List<String> commands = new ArrayList<String>();
  commands.add("--input");
  commands.add("/var/tmp/input0.csv");
  commands.add("--output");
  commands.add("/var/tmp/output0.csv");
  commands.add("--length");
  commands.add("5");
  return commands;
}

The generation of the command line would be, of course, the concatenation of the obtained list:

final StringBuffer commandLine = new StringBuffer();
for (final String command : commands) {
  buffer.append(command).append(' ');
}

After the refactoring, this method looks similar to:

public List<CommandLineElement> generateCommands(/* params */) {
  List<CommandLineElement> commands = new ArrayList<CommandLineElement>();
  commands.add(new CommandLineFixedString("--input"));
  commands.add(new CommandLineFile("/var/tmp/input0.csv"));
  commands.add(new CommandLineFixedString("--output"));
  commands.add(new CommandLineFile("/var/tmp/output0.csv"));
  commands.add(new CommandLineFixedString("--length"));
  commands.add(new CommandLineParameter(5));
  return commands;
}

While the generation of the command line would look similar to:

StringBuffer commandLine = new StringBuffer();
for (CommandLineElement command : commands) {
  buffer.append(command.getStringRepresentation()).append(' ');
}

When the list of commands is passed down to the methods that handle the conversion, these methods will know that a CommandLineFixedString stays like it is, but a CommandLineFile has to be handled in a different way depending on the target platform. Furthermore, these new classes contain useful information such as sequence number, in case a platform relies on this kind of information.

Handling of CTD files is special, since these have to be generated and are not true input files, because they contain information about the parameters and include paths of needed files. So, if we take a look at the OpenMSCommandGenerator before the refactoring we find:

public List<String> generateCommands(/* params */) {
  File iniFile = createINIFile(...);
  List<String> commands = new ArrayList<String>();
  commands.add("-ini");
  commands.add(iniFile.getCanonicalPath());
  return commands;  
}

After the refactoring, this method looks like:

public List<CommandLineElement> generateCommands(/* params */) {
  File iniFile = createINIFile(...);
  List<CommandLineElement> commands = new ArrayList<CommandLineElement>();
  commands.add(new CommandLineFixedString("-ini"));
  commands.add(new CommandLineCTDFile(iniFile));
  return commands;
}

The code responsible to convert such a node would find that a CTD file is included in the command line and would have enough information to handle it appropriately.

Exposed Methods, Packages

KNIME2Grid needs access to the new classes and also to some other packages in order to inspect in detail the nodes to convert. A summary of the newly exposed information follows:

chahuistle commented 7 years ago

Overview

Type Hierarchy

screen shot 2017-05-10 at 14 58 08

This figure depicts the hierarchy of the new classes/interfaces. The main interface is CommandLineElement and all other classes/interfaces implement/extend it. A brief explanation of each class/interface follows:

File Handling

Let's take a look of how a simple workflow in KNIME handles files. A GKN requires an input file and produces a file. Both files are stored in KNIME's temporary folder (e.g., /tmp). The command line of that node would look like:

$ [tool_name] --input /tmp/input0.csv --output /tmp/output0.xml

If we were to export the workflow on which this GKN has been included, we can no longer assume that the folder /tmp/ is available. This is handled by each target system in a different way. Therefore, when a GKN is to be exported, we need to keep track of the input and output files and, depending on the destination platform, these paths will be modified by KNIME2Grid.

Handling of CTDs

CTDs are indeed files, but they must be handled in a different way than ordinary files. Imagine a GKN that is fully CTD compatible, such as any OpenMS, SeqAn or BALL tool. Whenever it's executed in KNIME, the command line would be similar to:

$ [tool_name] --ctd /tmp/ctd0.ctd

Treating CTDs as files is not enough abstraction, because CTDs contain information about other files. Imagine that said tool produces one output file, requires one input file and one parameter. A section of /tmp/ctd0.ctd could look like:

<ITEM name="input"    type="input-file"   value="/tmp/input0.csv" />
<ITEM name="output"   type="output-file"  value="/tmp/output0.csv" />
<ITEM name="pH"       type="double"       value="7.2" />

If we were to export a workflow containing this tool, we would also need to modify the paths contained in CTDs, because we cannot assume that local paths will exist on remote execution environments. This is why the class CommandLineCTDFile exists.

Handling KNIME Workflow Files

It is possible to execute a KNIME workflow using the so-called batch mode. For this, one must provide the location of the archive that contains a valid KNIME workflow to execute, as shown below:

$ [path_to_knime] -workflowFile="/share/wfs/knime/wf_1.zip"

The class CommandLineKNIMEWorkflowFile represents these kind of files. KNIME2Grid generates these files on the fly, so it is possible to execute KNIME workflows on other platforms where KNIME has been installed and no user interface is required.