XAI4Chem package features and examples

miquelduranfrigola commented 3 months ago

Hi @HellenNamulinda and @GemmaTuron

Find below a suggested outline of the features that XAI4Chem must have, and of the different examples we will offer in the MSc project.

Components

The workflow is composed of 3 steps:

Molecular representation: descriptors or fingerprints. The user can choose from multiple fingerprints or descriptors to be used. This module must be extensible to include new types of fingerprints and descriptors in the future.
Supervised ML: tree-based methods such as XGBoost are used to fit a model. No hyperparameter tuning is necessary. Models will be evaluated in a cross-validation framework and, at the end, a full model will be trained. Explanations at training time will be done on the test sets.
Reporting: An automated report is generated in the specified output folder. The report contains multiple plots as well as tables with feature importance, performance scores, etc.

API and CLI

The package can be run as a CLI or as a Python API.

CLI

We should design a command similar to this: xai4chem train --input_file $INPUT_FILE --output_dir $OUTPUT_DIR --representation morgan_fingerprint

With new samples, we should work as follows: xai4chem infer --input_file $INPUT_FILE --model_dir $MODEL_DIR --output_dir $OUTPUT_DIR

Note that, at inference time, the $MODEL_DIR is the $OUTPUT_DIR obtained at training time.

Python API

With the Python API, we need 3 main modules (representation, supervised and reporting) with classes corresponding to each specific method. Importantly, all classes within a module need to have the same parameters.

Examples of descriptors and fingerprints would be: MorganFingerprint, DatamolDescriptor, etc. Descriptors and fingerprints should work with a fit and transform logic.
For supervised learning, we need Classifier and Regressor classes.
For reporting, we need a Reporter class having a tables and a figures method.

Case examples

For regression, we can use of the TDC datasets (to be discussed).
For classification, we will work with the MMV datasets (primary and/or secondary).

HellenNamulinda commented 3 months ago

This is clear to me, Let me work on it.

HellenNamulinda commented 3 months ago

Hello @miquelduranfrigola, The PR is for implementing the CLI. From today's meeting, the action is to update it and ensure comprehensive reporting at all stages(including interpretability plots at inference time). I'm updating it. I will ping you once done.

ersilia-os / xai4chem