UBC-MDS / pymleda

Python package that helps with preliminary eda for supervised machine learning tasks.
MIT License
0 stars 1 forks source link

Outline package details in README.md #7

Closed yaz-saleh closed 3 years ago

yaz-saleh commented 3 years ago
Saule-Atymtayeva commented 3 years ago

Features

The pymleda package will include the following classes/functions:

The pymleda package is intended to help with EDA for supervised machine learning tasks; there are other existing packages such as scikit-learn and pandas that contain some similar functionality. For example, pandas provides users with separate functions such as isnull(), isna(), and notna() to detect missing values and fillna(), interpolate() to fill them. Our pymleda package intends to augment the existing functionality of these packages with some additional features. The autoimpute_na function will combine these two steps of identifying and imputing missing values in columns of a dataframe while taking into account the type of these columns (numeric or categorical). Moreover, autoimpute_na will detect some common non-standard missing values manually entered by users (e.g., "not available", "n/a", "na", "-"). The output of the autoimpute_na function will be a dataframe with imputed values. In supervised machine learning, data splitting is often a multi-step process that involves splitting the dataset of interest into test and train portions and then further into X(features) and y(target class) subsets. Typically, the user has to create and keep track of different variables that hold each of these subsets of the data. SupervisedData is a wrapper class that builds upon the output of train_test_split() from sklearn and provides the user with quick access to appropriately-named attributes referring to each of these variables. These variables can then be used in subsequent steps of the machine learning pipeline.

tanmaysharma19 commented 3 years ago

Addressed in #20