EyeofBeholder-NLeSC / knime-demo

This is for keeping files for demostrating the usage of knime.
Apache License 2.0
0 stars 0 forks source link

Explore knime: what can/cannot do #1

Open jiqicn opened 2 years ago

jiqicn commented 2 years ago

General

jiqicn commented 2 years ago

Visualization and exploration of the input dataset

Data visualization as a dashboard

Built-in visualization nodes that are already very useful. To generate a dashboard that contains multiple linked visualization, it's possible to aggregate them together as a component and linked the desired variables. This will generate a dashboard that contains all the visualizations included. In this case, visualizations are automatically linked by the data attrs shared between the nodes (it requires the relevant options in the "Interactivity" panel to be checked).

image

It's also possible to customize the layout of the dashboard:

image

There are also some very nice nodes for visual data exploration, for instance, the official component of automated visualization:

image

Script integration for visualization

The "Python View" node helps integrating Python code to generate static visualization of data using Python visualization libraries such as matplotlib and seaborn. Also, a surprisingly nice feature is the Generic JavaScript View, which allows us to create highly customized visualization in JavaScript. It supports some widely-used and well-maintained libraries, such as D3 and Plotly. Something even better is that the Generic JavaScript View can be combined with some other built-in JS nodes to make a dashboard. I'm still looking for a way to allow the general JS view node to subscribe/publish events (this link might be useful).

image image

Visualization control

Some of the visualization nodes are intractable, which means the users could choose what attribute(s) they want to visualize. But for most of the view nodes implemented in JS, it doesn't support any control on it. In this case, it is possible to add a widget node that is connected to the key flow variable of the views. By checking the re-execution option, it will refresh the linked views after changing the value of the widget.

image
jiqicn commented 2 years ago

Train new models and choose a good pipeline

Pre-processing

The political statement dataset is stored as a csv file, and that can be easily loaded by dragging-and-dropping. There are nodes for converting statement strings to documents, so that the tagging node could work on the data.

Different taggers are supported, for instance, the POS tagger and the Stanford NE tagger. It also support to tag documents by using a pre-defined dictionary (with dictionary tagger). Steaming, case converting, and filtering are all possible with the corresponding nodes.

image

Transformation

To transform the statement documents to a form that can be used for training machine learning models, a typical way of doing this is to vectorize the documents somehow. KNIME supports different ways of word/doc vectorization, e.g. TFIDF and word2vec.

image

Training machine learning models

Similar to RapidMiner that various machine learning models are directly available as nodes. Also, you can download nodes or components that implement other nodes that are not included by default, e.g., XGBoost and deep neural networks.

Typically, a model contains two parts, which are the model leaner and predictor. The leaner node takes the training set as input for training the model, while the predictor node will test the performance of the model on the testing set. An example that trains a C4.5 by cross validation:

image

The performance of the model can be evaluated through a separate node, e.g., scorer, numeric scorer, etc., where some metrics (confusion matrix, accuracy, error, etc.) will be summarized and presented together.

image
jiqicn commented 2 years ago

Explain training results

Built-in methods in KNIME

KNIME natively supports two methods/algorithms for explaining prediction results of machine learning models (binary/multinomial): shapely additive explanation and LIME.

Both methods support creating an explanation on a single data item, by providing scores to the variables to indicate how much that variable supports/opposes the classification. Visualization could be something like this:

image

The left panel shows the information of the item (preprocessed document and tokens with TF.IDF value), and the right bar chart shows the results of SHAP and LIME separately. It's kind of weird that the two methods explain the results in totally opposite ways, so that's something that needs to be integrated further.