ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
GNU General Public License v3.0
2 stars 0 forks source link
data-cleaning data-science machine-learning open-source pypi statistical-analysis

DataSafari Banner

Welcome to DataSafari!

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners. Whether you're exploring data, evaluating statistical assumptions, transforming datasets, or building predictive models, DataSafari provides all the tools you need in one package.

In this README you can find a brief overview of how to start using DataSafari and what features you can utilize. For a more complete presentation you can visit DataSafari's docs.

Quick Start


To get started with DataSafari, install it using pip:

pip install datasafari

Or, if you prefer using Poetry:

poetry add datasafari


Import DataSafari in your Python script to begin:

import datasafari as ds

For detailed installation options, including installing from source, check our Installation Guide in the docs.

DataSafari at a Glance

DataSafari is organized into several subpackages, each tailored to specific data science tasks.

The logic behind the naming of each subpackage is inspired by the typical data workflow: exploring and understanding your data, transforming and cleaning it, evaluating assumptions and finally making predictions. - George


Explore and understand your data in depth and quicker than ever before.

Module Description
explore_df() Explore a DataFrame and gain a birds-eye view of summary statistics, NAs, data types and more.
explore_num() Explore numerical variables in a DataFrame and gain insights on distribution characteristics, outlier detection using multiple methods (Z-score, IQR, Mahalanobis), normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection.
explore_cat() Explore categorical variables within a DataFrame and gain insights on unique values, counts and percentages, and the entropy of variables to quantify data diversity.


Clean, encode and enhance your data to prepare it for further analysis.

Module Description
transform_num() Transform numerical variables in a DataFrame through operations like standardization, log-transformation, various scalings, winsorization, and interaction term creation.
transform_cat() Transforms categorical variables in a DataFrame through a range of encoding options and basic to advanced machine learning-based methods for uniform data cleaning.


Ensure your data meets the required assumptions for analyses.

Module Description
evaluate_normality() Evaluate normality of numerical data within groups defined by a categorical variable, employing multiple statistical tests, dynamically chosen based on data suitability.
evaluate_variance() Evaluate variance homogeneity across groups defined by a categorical variable within a dataset, using several statistical tests, dynamically chosen based on data suitability.
evaluate_dtype() Evaluate and automatically categorize the data types of DataFrame columns, effectively distinguishing between ambiguous cases based on detailed logical assessments.
evaluate_contingency_table() Evaluate the suitability of statistical tests for a given contingency table by analyzing its characteristics and guiding the selection of appropriate tests.


Streamline model building and hypothesis testing.

Module Description
predict_hypothesis() Conduct the optimal hypothesis test on a DataFrame, tailoring the approach based on the variable types and automating the testing prerequisites and analyses, outputting test results and interpretation.
predict_ml() Streamline the entire process of data preprocessing, model selection, and tuning, delivering optimal model recommendations based t on the data provided.

DataSafari in Action

Hypothesis Testing? One line.

from datasafari.predictor import predict_hypothesis
import pandas as pd
import numpy as np

# Sample DataFrame
df_hypothesis = pd.DataFrame({
    'Group': np.random.choice(['Control', 'Treatment'], size=100),
    'Score': np.random.normal(0, 1, 100)

# Perform hypothesis testing
results = predict_hypothesis(df_hypothesis, 'Group', 'Score')

How DataSafari Streamlines Hypothesis Testing:

Machine Learning? You guessed it.

from datasafari.predictor import predict_ml
import pandas as pd
import numpy as np

# Another sample DataFrame for ML
df_ml = pd.DataFrame({
    'Age': np.random.randint(20, 60, size=100),
    'Salary': np.random.normal(50000, 15000, size=100),
    'Experience': np.random.randint(1, 20, size=100)

x_cols = ['Age', 'Experience']
y_col = 'Salary'

# Discover the best models for your data
best_models = predict_ml(df_ml, x_cols, y_col)

How DataSafari Simplifies Machine Learning Model Selection:


DataSafari is licensed under the GNU General Public License v3.0. This ensures that all modifications and derivatives of this project remain open-source and freely available under the same terms.


Connect with me on LinkedIn or visit my website.

Thank you very much for taking an interest in DataSafari! 💚 - George