HK3-Lab-Team / pytrousse

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines.
Apache License 2.0
0 stars 1 forks source link
data-preprocessing data-wrangling dataframe pandas reproducibility

PyTrousse

Coverage Status CI Total alerts Language grade: Python

WIP ⚠️

PyTrousse collects into one toolbox a set of data wrangling procedures tailored for composing reproducible analytics pipelines. Data transformations include encoding, binning, scaling, strings replacement, NaN filling, column type conversion, data anonymization.

Getting started

The user can install PyTrousse in his/her Python virtual environment by cloning this repository:

$ git clone https://github.com/HK3-Lab-Team/pytrousse.git

and by running the following command:

$ cd pytrousse
$ pip install .

Main Features

Tracing the path from raw data

PyTrousse transformations are progressively wrapped internally with the data, thus linking all stages of data preprocessing for future reproducibility.

Along with processed data, every Dataset object document how the user performed the analysis, in order to reproduce it in the future and to address questions about how the analysis was carried out months, years after the fact.

The traced data path can be inspected through operation_history attribute.

>>> dataset.operations_history
[FillNA(
    columns=["column_with_nan"],
    value=0,
    derived_columns=["column_filled"],
), ReplaceSubstrings(
    columns=["column_invalid_values"],
    replacement_map={",": ".", "°": ""},
    derived_columns=["column_valid_values"],
)]

Automatic column data type detection

Wouldn't it be cool to have full column data type detection for your data?

PyTrousse expands Pandas tools for data type inference. Automatic identification is provided on an enlarged set of types (categorical, numerical, boolean, mixed, strings, etc.) using heuristic algorithms.

>>> import trousse
>>> dataset = trousse.read_csv("path/to/csv")

>>> dataset.str_categorical_columns
{"dog_breed", "fur_color"}

You can also get the name of boolean columns, numerical columns (i.e. containing integer and float values) or constant columns.

>>> dataset.bool_columns
{"is_vaccinated"}
>>> dataset.numerical_columns
{"weight", "age"}

Composable data transformations

What about having an easy API for all those boring data preprocessing steps?

Along with the common preprocessing utilities (for encoding, binning, scaling, etc.), PyTrousse provides tools for noisy data handling and for data anonymization.

>>> from trousse.feature_operations import Compose, FillNA, ReplaceSubstrings

>>> fillna_replacestrings = Compose(
...     [
...         FillNA(
...             columns=["column_with_nan"],
...             value=0,
...             derived_columns=["column_filled"],
...         ),
...         ReplaceSubstrings(
...             columns=["column_invalid_values"],
...             replacement_map={",": ".", "°": ""},
...             derived_columns=["column_valid_values"],
...         ),
...     ]
... )

>>> dataset = fillna_replacestrings(dataset)

Integrated tools for synthetic data generation

PyTrousse aids automated testing by inverting the data transformation operators. Generation of testing fixtures and injection of errors is automatically available (more information here).