Exploratory Data analysis (EDA) is an important step in any data analysis. However, carrying out EDA with the ggplot2 package requires a lot of coding effort. Moreover, it assumes a basic knowledge of functions and grammar of graphics syntax that are appropriate for visualizing categorical and numerical variables. The RSimplerEDA package addresses this issue by providing functions that are tailored to produce categorical, numerical and correlation plots using a single line of code. Furthermore, the package provides customization capability for the plots based on specific user needs (theme, title, font, size and etc.). The users are able to spend more time on analyzing the data set and less time configuring ggplot plot settings.
There are a number of packages that already provide similar functionality in the R Ecosystem, such as DataExplorer, and SmartEDA. However, most of them are not easily customizable. Our RSimplerEda package is light-weighted with focus in 3 common EDA plots and allows flexibility from plot types, color scheme, to plot titles.
corr_map
: Plot a correlation map with the given dataframe object
and a character vector with numerical features. Users are allowed to
set multiple arguments regarding the setting of the correlation plot
including method to calculate the correlation, color schemes, and
plot title.
numerical_eda:
This function takes in a data frame object, two
numeric columns, and produces either a scatter or line plot to
visualize the relationship between the two numerical features. Users
can optionally change default arguments for plot-type, color, title,
size of text, color-scheme, and toggle log transformation for the x
and y axis.
categorical_eda:
This function takes in a data frame object and
one categorical feature, to produce a histogram plot that visualizes
the distribution of the feature. Users can also choose to plot the
density graph of the feature by specifying in plot_type. The
function also offers customization on color, plot title, font size,
color-scheme, plot size, opacity level, and facet factor.
You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("UBC-MDS/RSimplerEDA")
These basic examplse which shows you how to use the function:
library(RSimplerEDA)
library(magrittr)
library(palmerpenguins)
penguins_drop_na <- penguins %>% tidyr::drop_na()
corr_map(penguins_drop_na, #Please make sure there is no NA in the given data
c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"))
numerical_eda(penguins,
body_mass_g,
bill_length_mm,
plot_type = "scatter", # also support Line Plot with "line"
color = species,
title = "Body mass (grams) vs Bill Length (mm)")
categorical_eda(penguins,
xval = body_mass_g,
plot_type = "histogram", # also support Density Plot with "density"
color=island,
facet_factor = "island",
facet_col = 1,
title="Distribution of Body Mass of Penguins from Each Island",
font_size = 8)
Please find the detail documentation in the vignette.
Contributor Name | GitHub Username |
---|---|
Cheuk (Chuck) Ho | ChuckHo777 |
Deepak Sidhu | deepaksidhu |
Nicholas Wu | nichowu |
We welcome and recognize all contributions. Please find the guide for contribution in Contributing Document.