EyeofBeholder-NLeSC / rapidminer-demo

The repo to store relevant files and results of demonstrating usage of RapidMiner in the "Eye of the Beholder" project.
Apache License 2.0
0 stars 0 forks source link

Explore RapidMiner: what can/cannot do #1

Open jiqicn opened 2 years ago

jiqicn commented 2 years ago

General

jiqicn commented 2 years ago

Instead of creating an extension (in Java), Rapidminer provides a much easier way of running our own code/script, that is the Python Scripting extension. This extension provides operators to run local-stored python codes.

For more information, see https://docs.rapidminer.com/9.10/studio/connect/python/.

jiqicn commented 2 years ago

Visualize and understand a annotated dataset

What can be done

Statistics

Files in many formats can be imported easily. By double click on the dataset imported, some statistics and visualizations can be seen from the "Results" view.

It's possible to look at the data table in detail, or see statistics of each column (attribute).

Screenshot 2022-05-12 103529

Screenshot 2022-05-12 103553

Visualization

For data visualization, it allows for interaction with the input dataset in the result view, visualizations can be saved either as a config file that can be reloaded later or as a static picture. However, this requires the users to know the basic usage (e.g. selecting desired attributes) in the result view of RapidMiner.

Screenshot 2022-05-12 094751

It's also possible to generate multiple visualizations automatically within a process by using the reporting extension. The output report can be either PDF, HTML, or some other formats, but all in a static way (see this example).

What cannot

Dashboard in the loop

There are ways of creating a dashboard for presenting/interacting with the visualizations, see this link, but it's way more complex, and since it will use a separate feature of RapidMiner (dashboard feature on AI Hub), dashboarding is no longer in the same loop as the other steps.

Interactively choose attributes for visualization

While creating a visualization, it's required to indicate the relevant features/attributes beforehand. This means that using RapidMiner may not allow users to interactively choose the attributes to be visualized.

jiqicn commented 2 years ago

Train new models and choose a good pipeline

What can be done

Pre-steps

The "Text Processing" extension of RapidMiner provides the widely used tools as operators, such as tokenization, extraction, filtering, stemming, transformation, etc.

create pipelines

Rather than implementing pipelines in different processes, a recommended way is to have all the pipelines in the same process. Users can choose which tool or model they want to use. By clicking any operator, the corresponding "Parameters" view will be shown for tuning. Also, the "Help" view will explain the operator and how to use it in detail. The result of the pipeline will be explained further both numerically and visually.

Screenshot 2022-05-12 133327

image

compare results of pipelines

It is possible to run multiple pipelines in the same process to see their results and performances separately.

Screenshot 2022-05-12 134510

Or to compare those pipelines on performance and runtime, and show their difference intuitively with a ROC curve visualization (for binominal classification tasks only).

Screenshot 2022-05-12 134532

image

image

What is unknown

Exploring models and results

There are explanations of the models in the result view, e.g. the tree graph of a decision tree. Also, there are operators for explaining the modeling results, but not sure if they meet the requirements from users.

Screenshot 2022-05-12 142256

jiqicn commented 2 years ago

Apply good pipeline on new dataset

In RapidMiner, users can easily apply existing processes to different input data, by just linking the new dataset to the processes. This requires these datasets to have a similar ontology, and may also require some additional preprocessing.

There are also ways of using RapidMiner processes in other ways and places. RapidMiner provided an open source Python library that users can call RapidMiner studio locally from Python and run processes in the repos owned by the users.

In addition, the PMML extension allows to write models into PMML standard files. PMML is a XML based scheme for describing and sharing statistical and machine learning models. Note that not all kinds of models are supported, you can find a list of supported models from their website.