Submission: kmeaningfulR(R)

Submitting Author:

Yihong (Hazel) Jiang (@HazelJJJ)
Mike Lynch (@mikelynch416)
Trevor Kinsey (@trevorki)
Sasha Babicki (@sbabicki)

Repository: kmeaningfulR Version submitted: 0.2.0 Editor: Tiffany Timbers(@ttimbers ) Reviewers: TBD

Archive: TBD Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: kmeaningfulR
Title: Performs Kmeans Clustering
Version: 0.0.0.9000
Authors@R: 
    c(person(given = "Trevor",
             family = "Kinsey",
             role = c("aut", "cre"),
             email = "tkinsey@student.ubc.ca"),
      person(given = "Sasha",
             family = "Babicki",
             role = c("aut"),
             email = "sbabicki@student.ubc.ca"),
      person(given = "Mike",
             family = "Lynch",
             role = c("aut"),
             email = "mlynch@student.ubc.ca"),
      person(given = "Hazel",
             family = "Jiang",
             role = c("aut"),
             email = "yihongj@student.ubc.ca"))
Description: Uses the K-means algorithm to group similar data into clusters. Includes functions to preprocess and visualize the data.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
Suggests: 
    testthat,
    covr,
    knitr,
    rmarkdown
Imports: 
    cluster,
    ggplot2,
    FactoMineR,
    forcats,
    dplyr,
    rlang
URL: https://github.com/UBC-MDS/kmeaningfulR
BugReports: https://github.com/UBC-MDS/kmeaningfulR/issues
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [x] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

Kmeaningful falls into data munging category because it intends to take raw data and perform some preprocessing to the data as well as help with finding cluster and the centroid of the cluster.
Who is the target audience and what are scientific applications of this package?

Target audience would be people who want to perform simple unsupervised learning to dataset using k-means method.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

R packages like the stats::kmeans and the ClusterR are doing similar things. We are not trying to break new ground with kmeaningful, but rather to build a simple and lightweight implementation from scratch.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples (that run successfully locally) for all exported functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[] A short summary describing the high-level functionality of the software

[] Authors: A list of authors with their affiliations

[] A statement of need clearly stating problems the software is designed to solve and its target audience.

[] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 4

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Hello team:

Please see my review below. I really like the whole package and it is a heavy undertaking. You have put in a lot of effort!

Package API

Function names

I think in general, the names and usage cases of your functions are reasonable and user-friendly, However, I do think function names like fit and assign are a little bit too general and too common in the R ecosystem, see here and here. I suggest that you change them to fit_kmeans and assign_kmeans.

The function name show_clusters is a bit odd to me. If you show something, it is not necessarily a visualization (you can show things with a table). Therefore, I think the name can be changed to vis_clusters perhaps.

README

There is no need to comment out the part where you generate the plot. Your GitHub action should be able to knit README.md from README.rmd automatically.

I have also modified some formatting details in the README, as well as a case of inconsistency in the documentation of show_clusters function.

Code

`preprocess` function

The function handles edge cases really well. However, I think it would be better if we give users the option to only center, or only re-scale the data.

In addition, the function seems to quietly impute NA cases as 0, which is a little unexpected. I hope there is more obvious documentation or warning about that.

`fit_assign` function

The series of functions are really well-written and a lot of effort is visible.

`show_clusters` function

I think the imported package of forcats seems unnecessary. I removed the line locally and it works. Please double-check this.

Also, the plot title seems not very informative. I suggest that you indicate the USA of PCA in the visualization. The color legend seems not very necessary since we do not need to know the IDs of each cluster anyways.

Argument names

I find some of the argument names not very consistent across the package. For example, the same thing is called clusters in show_clusters but labels in avg_sil_score. The use of centroids in show_clusters and centers in fit_assign also should be more consistent.

Documentation and website

In general, the function documentations are very well written.

Formatting issues

However, some generated documents for functions like show_clusters, avg_sil_score and find_elbow are not very well formatted. See here, here and here

Organization of website

I think it is a bit difficult to find detailed documentation of functions on your website. They are now under the "Reference" tab, which is a bit confusing.

Testing

The tests are very well written, especially test_fit_design.R.

The line coverage score is about 90%, which can be further improved.

Dependencies

List of dependencies on README

The list of dependencies is not completely clear to potential users. For example, it says "R 4.0.3". Does that mean we can only use the 4.0.3 version of anything equal to or higher than this version? I suggest that you can probably use ">=" or "=" to clarify. like glue (>= 1.3.0). See here

Please let me know what you think.

Mark

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples (that run successfully locally) for all exported functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 2.5

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Hi team, first of all congratulations on a very neat package. It definitely shows a lot of work put together.

Please find below a couple of comments:

The README is very clear and concise. I would probably just suggest to elaborate a little further in the R ecosystem. Especially with the sklearn style wrapper functions in R, it might be worth to explain a little more. Also, I wonder about the codecov and test-coverage badges and adding a couple of topics in your About section might be helpful for people to find your package. These are just minor things, overall it's a great README with features very well explained and a nice plot to exemplify your show_clusters().
The documentation is well put together, I would suggest to explain a little more about the general use of k means clusters. It might be not very understandable to beginners. Feels somehow done for more experienced audience and that might limit the reach of this great package.
In the Reference section of the vignette I can see all functions listed there with their description. Looks very neat. However, when you click on a function, the next page gives a nice documentation for the selected function but the title seems to include the function name plus the description and then again next lines the description it gets repeated. No big deal but it just looks a little noisy and could distract from the main things.
Overall, I find the functions written defensively which is always nice. The testthat looks fine but maybe just try to be more consistent through all 4. For example, two of them include author and two of them do not and also with some of the style. I know this is challenging since each one wrote one, but it's always worth it to try to get all of them on the same level.
I believe you did a great job handling your issues. A general recommendation will always be to communicate through your issues. Nonetheless I think you handled it awesome. I went through most of them and I could see some clear comments.

So once again really good job! As you can see, I could only suggest minor details and had to be picky since your functions and code are really good and trying not to be repetitive with the great review comments from previous reviewer.

If you have any question or would like some further explanations, I'd be happy to help.

Daniel

UBC-MDS / software-review-2021