UBC-MDS / datexplore

python package for EDA and data cleaning
MIT License
1 stars 0 forks source link

Summary of feedback from peer review #77

Open jcairn02 opened 9 months ago

jcairn02 commented 9 months ago

This post is a copy of each suggestion , I will reply to this with commonalities and summarise the suggestions so we can divide the work accordingly and address issues we think are most important . https://github.com/UBC-MDS/software-review-2024/issues/8

"A more detaied introduction of the package. Now the datexplore has a clear and direct introduction in readme saying that it is a package for exploratory data analysis and data cleaning. It would be even better if there is more information on what kind of EDA and cleaning this one is capable of as the field of EDA and data cleaning is pretty broad. Having a more detailed introduction could help user know if the package is the one they need for their specific tasks. Installation: Maybe it will be more helpful to add one more line under installation saying to clone the repo first before running the installation code. Since there may be users not that experienced with git repo and if they did not clone the repo first, they may have issue install this package. Great test functions within test_datexplore.py! I also noticed that there is a test.py file within tests directory, which contains def test_example(): assert 1 == 1. I'm not sure if it could be put into the test_datexplore.py for clearer structure or if it is not used anymore, it may be clearer to remove it? within the example.ipynb, I saw an error message under "For column name with a space:" , it will be better to view the example without any error message. Other things looks nice. If I have to make one more comment, It may be better to have more examples for users to understand how the package works."

"I had trouble installing your environment with 'conda env create -f environment.yml' from the root of the project. I am not sure where the issue comes from but the installation process got stuck. I solved the issue by just creating an empty environment and then installed poetry in that environment. Building on my point from 1, you do not need the environment file when working with poetry. I think you could even remove the environment.yml file form the directory to reduce the number of files in your directory. When I tried to run your tests 5 failed, 12 passed and I got 35 warnings. This has been mentioned by other reviewers before. You have collected all the tests in one file. Unless the functions are very closely related to each other, I prefer having the tests for each function in individual files to make it easier to find the tests for each function. In your functions, you check for the correct input.You could add some tests that test if the error is returned as expected when the user inputs wrong arguments. You have a file tests/test.py that seems like it's not used. You could remove that file."

"The datexplore package is designed for exploratory data analysis and data cleaning in the early stages. It smartly complements broader libraries like Pandas and Scikit-learn by focusing on simplicity and user efficiency. The purpose of the package is clear, and the need for it is well-articulated. The package is well-structured and user-friendly. The documentation is informative, providing a solid overview of the key functions: clean_names, visualise, and detect_outliers. The examples are practical and effectively demonstrate the package's capabilities. Some suggestions for improvement:

  1. When citing function names and introducing parameters in the documentation, consider using special formatting such as bold or code for clarity. For instance, when describing the clean_name() function, the optional parameter case, and its possible values should be emphasized to stand out from the main text, which regular font doesn't achieve.
  2. The package's introduction could be expanded to provide more details.
  3. In the README, separating the badge from the accompanying text might enhance the visual presentation (just a small tip).
  4. I encountered issues when trying to create the environment using environment.yml, which might be due to my local setup, but it would be helpful if the author provided more detailed instructions. The error encountered was:"

" Below are the comments that I have regarding to the project:

When running "pytest tests/", it shows that there are 4 tests that failed. The ci-cd also reflected this issue. The docstring for the function "visualise()" does not specify the example input dataframe df. It might be more clear to add the exact structure of df into the docstring. It would be great to include badges for other categories (i.e. ci-cd and test coverage) Might want to be more detailed in the outline section. Might want to add the "git clone" and "cd" step for clarity. Other things look good!"

jcairn02 commented 9 months ago
  1. Expand Introduction: Provide a more detailed introduction in the README, explaining specific EDA and cleaning capabilities.
  2. Installation Instructions: Add instructions to clone the repo before installation. Might want to add the "git clone" and "cd" step for clarity
  3. Test File Organization: Merge or remove redundant test files (e.g., test.py in the tests directory).
  4. Correct Errors in Examples: Fix the error in example.ipynb under "For column name with a space."
  5. Additional Examples: Provide more examples to demonstrate how the package works.
  6. Installation Process: Address issues with the conda env create -f environment.yml command and consider removing environment.yml if unnecessary due to Poetry.
  7. Test Structure and Coverage: Organize tests into individual files for each function, fix failing tests, and improve test coverage, including tests for incorrect inputs.
  8. Documentation Formatting: When citing function names and introducing parameters in the documentation, consider using special formatting such as bold or code for clarity.
  9. Detailed Outline: Enhance the detail in the package's outline section.
  10. Additional Badges: Include badges for CI/CD and test coverage in the README. Separate badges from text for visual improvements.
  11. clean_names and detect_outliers function examples aren't complete — example of expected output (or description of that output) needs to be filled in (-1) (Feedback from Milestone1 by TA's)

I think step 6 is critical , step 7 is important and was mentioned by multiple people. Most of the general suggestions are optional, the easy quick fixes I agree with and would like to do. IE 2, 3, 8

killerninja8 commented 9 months ago

Thanks so for summarising, Jordan. It looks neat!

I agree with most of the feedback and can get started on it tonight.

I'm happy to work on

and more stuff as we go along.

killerninja8 commented 9 months ago

Also, do y'all think we need to work on 7? AFAIK tiff's repo has only one test file as well.

jcairn02 commented 9 months ago

I think we don't need more than one test file, but we need to make sure our tests work and are reproduceable. We could also improve test coverage.