sweber15 commented 4 years ago

Submitting Author: Group 4: Sarah Weber (@sweber15 ) , Cheng (Marvin) Min (@marvinmin), Yi (James) Liu (@v5y8 ), Gaurav Sinha (@sgauravm ) Repository: https://github.com/UBC-MDS/edar/tree/v1.1.0 Version submitted: v1.1.0 Editor: Varada Kolhatkar (@kvarada ) Reviewer 1: Yu Fang (@lori94 ) Reviewer 2: Andrea Lee (@andrealee011 )
Package: edar
Title: EDA
    c(person(given = "Cheng",
           family = "Min",
           role = c("aut", "cre", "ctb", "cph"),
           email = "marvin.cmin@gmail.com"),
       person(given = "Yi",
           family = "Liu",
           role = c("aut", "ctb", "cph")),
        person(given = "Gaurav",
           family = "Sinha",
           role = c("aut", "ctb", "cph")),
        person(given = "Sarah",
           family = "Weber",
           role = c("aut", "ctb", "cph")))
Description: Conduct initial EDA for exploring data in a dataframe.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.0.2
    testthat (>= 2.1.0),
URL: https://github.com/UBC-MDS/edar
BugReports: https://github.com/UBC-MDS/edar/issues
VignetteBuilder: knitr


This package provides a single function which generates a general exploratory data analysis report. The package simplifies EDA tasks that require a lot of coding effort like describing the data, knowing NA values and plotting the distributions of the variables which are needed to understand the data well.

The target audience for our package includes anyone that wants to understand a data set specifically including data scientists and data analysts. The scientific applications of this package are that users can perform a simplified EDA on a dataframe without the intense coding.

There are other similar packages which can be used for EDA analysis. A package which does a similar thing is DataExplorer. Our package looks specifically at the EDA needed for data science focusing on NA values, the distribution of categorical variables, summary statistics for numeric variables and the correlation in a dataframe.

Estimated hours spent reviewing: 1

Review Comments


lori94 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide


The package includes all the following forms of documentation:

Estimated hours spent reviewing: 2hrs

Review Comments

I got the following error when running R CMD check:

checking if this is a source package ... ERROR Only source packages can be checked.

Also goodpractice::gp() suggests It is good practice to

✖ fix this R CMD check ERROR: Only source packages can be checked.

as well as the following warnings:

Warning messages: 1: In MYPREPS[[prep]](state, quiet = quiet) : Prep step for test coverage failed. 2: In MYPREPS[[prep]](state, quiet = quiet) : Prep step for cyclomatic complexity failed. Warning in file(con, "r") : cannot open file 'man': No such file or directory ERROR computing Rd index failed:cannot open the connection

When I ran devtools::test():

devtools::test() No tests: no files in /Library/Frameworks/R.framework/Versions/3.6/Resources/library/edar/tests/testthat match '^test.*.[rR]$'

When I run spelling::spell_check_package(), I got the following suggestion for the spelling, there might be a minor typo just for your reference.

spelling::spell_check_package() DESCRIPTION does not contain 'Language' field. Defaulting to 'en-US'. WORD FOUND IN EDA title:1 description:1

Some suggestion after I playing around the functions with the flights dataset from nycflights13 package:

  1. For the describe_num_var() function, it is actually not necessary for the user to specify all the numerical columns(especially when the dataset is large), they can just pass the numerical columns that they want to explore. So this can be clarified in the user guide.

  2. Plots from the describe function are very helpful but the x-axis labels on the plot that generated from describe_cat_var are overlapped when there are many different categorical variables.

  3. The describe_na_values()function worked perfectly on small data set and users can see which data is missing but when dealing with a relatively large dataset, there were so long outputs full of 0s and 1s, which makes it hard for users to find the NA just by eyeballing. So maybe provide the index of the missing value will be better.

sgauravm commented 4 years ago

All the following changes can be found in the new release v1.2.0.