Closed yaz-saleh closed 3 years ago
[ ] a summary paragraph that describes the project at a high level
The project aims to build a python package called pymleda
with relevant functions to help with preliminary EDA for a given dataset as a Pandas dataframe. The package contains functions and classes that help perform various data preparation and wrangling tasks such as data splitting, exploration, imputation, and scaling. These functionalities were identified as commonly-performed tasks in supervised machine learning settings but may provide value in other project types as well.
[ ] a bulleted list of the functions (and datasets if applicable) that will be included in the package (this should be a 1-2 sentence description for each function/dataset)
The pymleda package will include the following classes/functions:
SupervisedData
is a wrapper class that splits a pandas dataframe into train and test sets and further into X and y subsets based on a list of user-provided columns.
dftype()
function will return the type of columns and variables for the input data frame. Furthermore, if there are non-numeric columns, it will return the unique values of non-numeric columns and their length.
autoimpute_na()
function to identify and impute missing values for different attributes in a given pandas dataframe.
dfscaling()
function to apply standard scaling to the numerical features in a pandas dataframe.
[ ] a paragraph describing where your packages fit into the Python ecosystem (are there any other Python packages that have the same/similar functionality? Provide links to any that do. If none exist, then clearly state this as well)
The pymleda
package is intended to help with EDA for supervised machine learning tasks; there are other existing packages such as scikit-learn
and pandas
that contain some similar functionality.
For example, pandas
provides users with separate functions such as isnull()
, isna()
, and notna()
to detect missing values and fillna()
, interpolate()
to fill them. Our pymleda
package intends to augment the existing functionality of these packages with some additional features.
The autoimpute_na
function will combine these two steps of identifying and imputing missing values in columns of a dataframe while taking into account the type of these columns (numeric or categorical). Moreover, autoimpute_na
will detect some common non-standard missing values manually entered by users (e.g., "not available", "n/a", "na", "-"). The output of the autoimpute_na
function will be a dataframe with imputed values.
In supervised machine learning, data splitting is often a multi-step process that involves splitting the dataset of interest into test and train portions and then further into X(features) and y(target class) subsets. Typically, the user has to create and keep track of different variables that hold each of these subsets of the data. SupervisedData is a wrapper class that builds upon the output of train_test_split()
from sklearn
and provides the user with quick access to appropriately-named attributes referring to each of these variables. These variables can then be used in subsequent steps of the machine learning pipeline.
Addressed in #20
[ ] a summary paragraph that describes the project at a high level
[ ] a bulleted list of the functions (and datasets if applicable) that will be included in the package (this should be a 1-2 sentence description for each function/dataset)
[ ] a paragraph describing where your packages fit into the Python ecosystem (are there any other Python packages that have the same/similar functionality? Provide links to any that do. If none exist, then clearly state this as well).