jiqicn commented 2 years ago

General

Available on Windows, Linux, and Mac
Pricing
- KNIME Analytics Platform is open source and free to use locally
- KNIME Server needs to be paid
The KNIME Forum
Allow Python integration
- Extensions are developed in Java
- Support running Jupyter inside KNIME (see this link)
Node monitor allows closer looking at the runtime situation of nodes.
KNIME allows more flexible workflow control than RapidMiner. For instance, it allows to add nodes for configuring other nodes by passing a flow variable to those nodes, and this can be applied to all the children nodes on the same branch (need further setup). KNIME also supports if switch that allows the user to choose which branch the workflow should go.
Nodes can be grouped and aggregated as a metanode/component to clean up mess workflow. The difference from these two are minor. Basically, a component has more functions and ports than a meta node, see the table below:

jiqicn commented 2 years ago

Visualization and exploration of the input dataset

Data visualization as a dashboard

Built-in visualization nodes that are already very useful. To generate a dashboard that contains multiple linked visualization, it's possible to aggregate them together as a component and linked the desired variables. This will generate a dashboard that contains all the visualizations included. In this case, visualizations are automatically linked by the data attrs shared between the nodes (it requires the relevant options in the "Interactivity" panel to be checked).

It's also possible to customize the layout of the dashboard:

There are also some very nice nodes for visual data exploration, for instance, the official component of automated visualization:

Script integration for visualization

The "Python View" node helps integrating Python code to generate static visualization of data using Python visualization libraries such as matplotlib and seaborn. Also, a surprisingly nice feature is the Generic JavaScript View, which allows us to create highly customized visualization in JavaScript. It supports some widely-used and well-maintained libraries, such as D3 and Plotly. Something even better is that the Generic JavaScript View can be combined with some other built-in JS nodes to make a dashboard. I'm still looking for a way to allow the general JS view node to subscribe/publish events (this link might be useful).

Visualization control

Some of the visualization nodes are intractable, which means the users could choose what attribute(s) they want to visualize. But for most of the view nodes implemented in JS, it doesn't support any control on it. In this case, it is possible to add a widget node that is connected to the key flow variable of the views. By checking the re-execution option, it will refresh the linked views after changing the value of the widget.

jiqicn commented 2 years ago

Train new models and choose a good pipeline

Pre-processing

The political statement dataset is stored as a csv file, and that can be easily loaded by dragging-and-dropping. There are nodes for converting statement strings to documents, so that the tagging node could work on the data.

Different taggers are supported, for instance, the POS tagger and the Stanford NE tagger. It also support to tag documents by using a pre-defined dictionary (with dictionary tagger). Steaming, case converting, and filtering are all possible with the corresponding nodes.

Transformation

To transform the statement documents to a form that can be used for training machine learning models, a typical way of doing this is to vectorize the documents somehow. KNIME supports different ways of word/doc vectorization, e.g. TFIDF and word2vec.

Training machine learning models

Similar to RapidMiner that various machine learning models are directly available as nodes. Also, you can download nodes or components that implement other nodes that are not included by default, e.g., XGBoost and deep neural networks.

Typically, a model contains two parts, which are the model leaner and predictor. The leaner node takes the training set as input for training the model, while the predictor node will test the performance of the model on the testing set. An example that trains a C4.5 by cross validation:

The performance of the model can be evaluated through a separate node, e.g., scorer, numeric scorer, etc., where some metrics (confusion matrix, accuracy, error, etc.) will be summarized and presented together.

jiqicn commented 2 years ago

Explain training results

Built-in methods in KNIME

KNIME natively supports two methods/algorithms for explaining prediction results of machine learning models (binary/multinomial): shapely additive explanation and LIME.

Both methods support creating an explanation on a single data item, by providing scores to the variables to indicate how much that variable supports/opposes the classification. Visualization could be something like this:

The left panel shows the information of the item (preprocessed document and tokens with TF.IDF value), and the right bar chart shows the results of SHAP and LIME separately. It's kind of weird that the two methods explain the results in totally opposite ways, so that's something that needs to be integrated further.

EyeofBeholder-NLeSC / knime-demo

Explore knime: what can/cannot do #1

General

Visualization and exploration of the input dataset

Data visualization as a dashboard

Script integration for visualization

Visualization control

Train new models and choose a good pipeline

Pre-processing

Transformation

Training machine learning models

Explain training results

Built-in methods in KNIME