UBC-MDS / software-review-2024

0 stars 0 forks source link

PyXplor - Group 2 #9

Open rbouwer opened 8 months ago

rbouwer commented 8 months ago

Submitting Author: Name (@rbouwer) All current maintainers: (@arturoboquin, @iris0614, @phchen5) Package Name: PyXplor One-Line Description of Package: A package for simplifying the EDA of different data types! Repository Link: https://github.com/UBC-MDS/PyXplor Version submitted: 2.0.0 Editor: @ttimbers
Reviewer 1: @Marcony1 - Marco Polo Bravo Montiel Reviewer 2: @salva-u Salva Umar
Reviewer 3: @joeywwwu - Joey Wu Reviewer 4: @YimengXia - Yimeng Xia

Archive: TBD JOSS DOI: TBD Version accepted: TBD Date accepted (month/day/year): TBD


Code of Conduct & Commitment to Maintain Package

Description

pyxplor is a comprehensive Python package designed to automate and streamline the Exploratory Data Analysis (EDA) process. Tailored for various data types including numeric, categorical, binary, and time series data, pyxplor aims to enhance data interpretation through a suite of specialized plotting functions. This package seeks to reduce the complexity and time invested in initial data analysis, making it an essential tool for data scientists and analysts at all levels.

Scope

Domain Specific & Community Partnerships

Community Partnerships

If your package is associated with an existing community please check below:

[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.

While there are several EDA packages in the Python ecosystem, such as pandas-profiling (link) and sweetviz (link), pyxplor differentiates itself by offering specialized functions for different data types. This targeted approach enables more nuanced and relevant insights, particularly for binary and time-series data which are often less catered for in existing tools. pyxplor complements these existing tools by filling these specific gaps, thus enriching the Python EDA toolkit.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication Options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Confirm each of the following by checking the box.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

salva-u commented 8 months ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

Functionality

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments:

Overall the package is super useful and very well documented. It is easy to follow along and really enjoyed going through the detailed vignette. The package has nice visualisations and the readme is very nice and well-written. There are some points below for your consideration:

  1. The tests below did not pass, maybe a relative path issue be considered in improvements:
    
    ===================================================================================== short test summary info ===================================================================================== 
    FAILED tests/test_plot_binary.py::test_figure - _tkinter.TclError: Can't find a usable init.tcl in the following directories:
    FAILED tests/test_plot_time_series.py::test_output_path_generation - _tkinter.TclError: Can't find a usable tk.tcl in the following directories:
    ================================================================================== 2 failed, 59 passed in 31.84s ================================================================================== 

2. The pyxplor.py can be deleted since it does not contain anything
3. Break the Main Vignette, into smaller ones, could have a quick look, basic usage section. A longer in depth vignette going into the lengthy explorations that are possible with the package (separated) under a separate tab. Albeit the current breakdown is definitely helpful.
4. For the categorical bar plots, they could be ordered by largest to smallest to make it easier to visualise the categories. They are ordered in the facetted plots, only the first plot in the docs for categorical is not.
5. The badges are not all all there, could consider all adding the 2 missing ones ` Continuous integration and test coverage,` and `Python versions supported`
joeywwwu commented 8 months ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

Functionality

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1h

Review Comments

The package is well organized and very convenient. I really enjoy using it to automate and streamline the exploratory data analysis process. The documentation has explained the use of this package well and provides excellent examples for each core function. Some aspects to consider for improvement are:

  1. The pyxplor.py is suggested to be deleted as it contains nothing.
  2. It would be better if there was more context about the dataset, and a more detailed introduction explaining the importance and applications of EDA in data science would help users understand the relevance and application of the examples.
  3. If possible, include interactive elements or widgets in the tutorial for a hands-on experience. These interactive elements can make the learning process more engaging and effective, and help users better understand the capabilities and use of the package.
  4. See badges example, it would be better to include more relevant badges.

Tests are passed in my end, good job!

$ pytest tests/
========================================= test session starts ==========================================
platform darwin -- Python 3.11.7, pytest-7.4.4, pluggy-1.3.0
rootdir: /Users/wenweiwu/Desktop/UBC_MDS/block4/DSCI_524/PyXplor
plugins: cov-4.1.0, anyio-4.2.0
collected 61 items                                                                                     

tests/test_plot_binary.py ..................                                                     [ 29%]
tests/test_plot_categorical.py .............                                                     [ 50%]
tests/test_plot_numeric.py ..............                                                        [ 73%]
tests/test_plot_time_series.py ................                                                  [100%]

========================================= 61 passed in 13.66s ==========================================
YimengXia commented 8 months ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

Functionality

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1h


Review Comments

Overall, I think this package is really helpful in reducing the workload in EDA. The example usage is very comprehensive and clear, enjoy reading through, good job!

Below are the feedback:

  1. Adding the missing badges (Continuous integration and test coverage & Python versions supported)
  2. You can delete the 'pyxplor.py' file as it is empty.
  3. It might be beneficial to reconsider the use of colour in the Distribution of Categorical Variables. The current approach assigns colours to bars based on their count ranking, which could potentially confuse users. For example, in your example.ipynb, pickup_borough is displayed in green for Bronx, while dropoff_borough is in red, solely due to count variations for the same variable. Given that each bar already has a clear label, the additional colour coding might not be necessary and could be removed.
  4. You can enhance the EDA experience by offering scatterplots between the target variable (if numerical) and the numerical explanatory variables, catering to users who want to visualize the relationships before creating models for predictions.
  5. Consider EDA for ordinal variables, I think "passengers" is visualized as an ordinal variable instead of categorical variable, as you've maintained the natural order of the number of passengers instead of ranking them as you did with other categorical variables. (see example.ipynb).
Marcony1 commented 8 months ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

Functionality

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1.5h


Review Comments

  1. pyexplor.py is empty.
  2. Although there is a clickable button that leads you to the documentation, I would suggest adding a link in the "About" section of the repo so that less experienced users don't struggle as much when trying to find it.
  3. After following the installation instructions, I couldn't run the tests (Heads-up, I am using a Windows computer). I got the following output in the console:
(pyxplor) Marcony1@MSI ~/OneDrive - Fundacion Universidad de las Americas Puebla/Documents/MDS/Block 4/DSCI 524/PyXplor (main)
$ ls
CHANGELOG.md  CONDUCT.md  CONTRIBUTING.md  docs/  img/  LICENSE  poetry.lock  pyproject.toml  README.md  src/  tests/

(pyxplor) Marcony1@MSI ~/OneDrive - Fundacion Universidad de las Americas Puebla/Documents/MDS/Block 4/DSCI 524/PyXplor (main)
$ pytest tests/
bash: pytest: command not found

I had to pip install pytest and then I got this:

pip install pytest

=============================================== short test summary info ===============================================
ERROR tests/test_plot_binary.py
ERROR tests/test_plot_categorical.py
ERROR tests/test_plot_numeric.py
ERROR tests/test_plot_time_series.py
ERROR tests/test_pyxplor.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 5 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 5 errors in 0.41s ==================================================

(pyxplor) Marcony1@MSI ~/OneDrive - Fundacion Universidad de las Americas Puebla/Documents/MDS/Block 4/DSCI 524/PyXplor (main)
$ pytest tests/ --cov
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --cov
  inifile: None
  rootdir: C:\Users\Marcony1\OneDrive - Fundacion Universidad de las Americas Puebla\Documents\MDS\Block 4\DSCI 524\PyXplor

pip install pytest-cov

---------- coverage: platform win32, python 3.11.7-final-0 -----------
Name                             Stmts   Miss  Cover
----------------------------------------------------
tests\test_plot_binary.py           94     92     2%
tests\test_plot_categorical.py      70     69     1%
tests\test_plot_numeric.py          80     79     1%
tests\test_plot_time_series.py      66     64     3%
tests\test_pyxplor.py                1      0   100%
----------------------------------------------------
TOTAL                              311    304     2%

=============================================== short test summary info ===============================================
ERROR tests/test_plot_binary.py
ERROR tests/test_plot_categorical.py
ERROR tests/test_plot_numeric.py
ERROR tests/test_plot_time_series.py
ERROR tests/test_pyxplor.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 5 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 5 errors in 0.37s ==================================================

That way, I could run the tests, but I got some errors.

  1. I tried to run the code in example.ipynb and I got this:
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-xxxxxxxxxxxx> in <module>
----> 1 import pyxplor
      2 from pyxplor.plot_binary import plot_binary
      3 from pyxplor.plot_categorical import plot_categorical

ModuleNotFoundError: No module named 'pyxplor'

So, apparently, pyxplor was still not available in my computer after following the instructions. I ended up using pip install pyxplor. After that, I could run the code without any issues. Also, I re-ran the tests and now all of them passed:

================================================== warnings summary ===================================================
tests\test_plot_binary.py:2
  C:\Users\Marcony1\OneDrive - Fundacion Universidad de las Americas Puebla\Documents\MDS\Block 4\DSCI 524\PyXplor\tests\test_plot_binary.py:2: DeprecationWarning:
  Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
  (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
  but was not found to be installed on your system.
  If this would cause problems for you,
  please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

    import pandas as pd

tests/test_plot_time_series.py::test_valid_input
  C:\Users\Marcony1\miniconda3\envs\pyxplor\Lib\site-packages\pyxplor\plot_time_series.py:104: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
    ts_data = input_df.set_index(date_column).resample(freq)[column].mean()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===========================================
 61 passed, 2 warnings in 21.40s ===========================================
  1. As of today, the installation instructions' title is "Installation (developers)". As a regular user, that made me think that I was looking into the wrong section and I searched for the "Installation (mortals)" section. As I didn't find one, I assumed that you chose this title as the package is still in development. However, I would remove the "(developers)" in the final version or create a regular user section and move the developer's instructions to the ReadTheDocs full documentation website.

In general, I really liked your package and I do find it pretty handy. As a more advanced user, I could definitely make things work. However, I think that if you'ld want rookie programmers to use your package, it would be helpful if you made the instructions more baby-level as well as making sure that it works in different OS's so that they won't have any problem trying to use it.

Good job overall!