amkrajewski / nimCSO

nim Composition Space Optimization is a high-performance tool leveraging metaprogramming to implement several methods for selecting components (data dimensions) in compositional datasets, as to optimize the data availability and density for applications such as machine learning.
https://nimcso.phaseslab.org
MIT License
23 stars 1 forks source link

JOSS Review Comments #3

Open bdice opened 2 months ago

bdice commented 2 months ago

Hi @amkrajewski, I am one of the reviewers for your paper: https://github.com/openjournals/joss-reviews/issues/6731

I am copying a portion of the reviewer checklist here, with some notes that I have for improvement. I've checked most of the boxes on the original review, so I will leave those out of this list.

I am also opening a pull request with some suggested changes: https://github.com/amkrajewski/nimCSO/pull/4

I will add some comments on that pull request and link to the corresponding notes in this issue.

Review checklist for @bdice

Functionality

I looked at the "native Python" benchmark first. It appears that the NumPy benchmark is not measuring the same amount of work as the native Python benchmark. For example, the construction of elMatrix from the input data should be a part of the timing for the NumPy benchmark (and similarly for the nimCSO benchmark). It's okay to compare algorithm runtime that excludes the CSV parsing (such as the cost to create elementalList in the native Python benchmark) but it is not a fair comparison to pre-process that parsed data by converting into a matrix for NumPy or bit-packed array. Make sure that all benchmarks are measuring as close to the same thing as possible when making relative claims about performance. I understand it is important to amortize that cost of creating a bit-packed array for the input data when computing across many element lists, but it's a bit harder to compare to the native Python solution. I suggest trying to find a middle ground, or at least giving more description of what is being measured.

Documentation

The statement of need describes applications in economics and medicine as well as materials. I understand the materials science application (which is closer to my field of expertise, and is substantiated by the paper) but the economics and medicine use cases need citations.

I'm new to nim and had some confusion while installing. It seems like it is best practice to use nimble install --depsOnly rather than listing the dependencies directly in the README. Also, I tried nimble install and nimble develop but was unsure how to get the nimCSO executable onto my PATH. The compile command nim c -r -f -d:release src/nimcso produces an executable in src/nimCSO.out but that doesn't seem like the proper "final destination." I am thinking of tools like pip install that can place entry points or executables directly onto your PATH.

The repository would benefit from a CONTRIBUTING.md. It does not appear that the repository has had many issues or pull requests. It would be good to explicitly instruct users to file issues and pull requests.

I strongly recommend configuring a tool like pre-commit to automatically trim trailing whitespace and format the code. Having enforceable style conventions, even a minimal set, smooths the process of contributing to an open-source project. If you'd like help here, I can explain more and file a PR.

Software paper

I think the paper would benefit from a toy example, explaining what it does. It took me some time to understand precisely what is meant by "selecting components" and phrases like "least preventing". Perhaps a toy example could look something like this?

You run a pizza store and want to simplify the menu. Customers can choose from over 30 toppings. You have sales data about the toppings on every pizza purchase from the last year. Find the best combination of 15 toppings that will satisfy the most orders. How many more orders could be satisfied if you had made a menu with the best 16 toppings? The best 14 toppings?

Obviously this is very contrived, but I think the problem statement and motivation for this work could be clearer, and connect the domain field (materials science / compositionally complex materials) to the broadly-applicable selection algorithms implemented in nimCSO.


Second part of the review begins here

I skimmed the paper on a second read but didn't see any comparisons to other software. How do people solve this problem today? Is it all custom scripts or are there other libraries out there? (e.g. in other programming languages, with different search algorithms, etc.) Especially with the mention of other fields where this is useful (economics, medicine), I think prior art probably exists.

There are a few typos that I have addressed in #4. Generally the writing quality is good.

Some things noted above would be better with a citation (describing examples in economics/medicine, and adding references to other software in this space). Citation style looks fine. One note: the citation for "Genetic algorithms in search, optimization and machine learning" (https://doi.org/10.5860/choice.27-0936) doesn't looks like a useful link. It goes to a login page that has no information about the title. Can you verify this?

amkrajewski commented 2 months ago

Hi @bdice Thank you for taking the time to provide such detailed feedback! And, looking at PR #4 , for providing some solutions!

I'm traveling this week for a conference in Italy, so I'm a bit time-limited, but I should be able to revise everything point-by-point over the next two weeks.

amkrajewski commented 2 months ago

The GitHub Codespaces took a long time to open (>5 minutes). I would try consolidating the conda install commands. The nim packages on conda-forge depend on conda packages of gcc so it is possible to avoid the 'apt install gcc' steps entirely.

Thanks for catching that! I accidentally limited the pre-building scope to the Eastern US. Now, it should start in 30ish seconds worldwide.

amkrajewski commented 2 months ago

apt-get install nim did not work on my Ubuntu system. Perhaps the "universe" channel is not enabled by default? I see the package here: https://packages.ubuntu.com/noble/nim but it is version 1.6, while this software claims to require Nim 2.0 in nimcso.nimble. Please verify that sufficiently recent versions of nim are available from the recommended installation paths.

Indeed, we require 2.0 in the nimble file, but that's more about what we provide support for (and recommend performance-wise). It should compile fine under 1.6.x and work as expected.

amkrajewski commented 2 months ago

I looked at the "native Python" benchmark first. It appears that the NumPy benchmark is not measuring the same amount of work as the native Python benchmark. For example, the construction of elMatrix from the input data should be a part of the timing for the NumPy benchmark (and similarly for the nimCSO benchmark). It's okay to compare algorithm runtime that excludes the CSV parsing (such as the cost to create elementalList in the native Python benchmark) but it is not a fair comparison to pre-process that parsed data by converting into a matrix for NumPy or bit-packed array. Make sure that all benchmarks are measuring as close to the same thing as possible when making relative claims about performance. I understand it is important to amortize that cost of creating a bit-packed array for the input data when computing across many element lists, but it's a bit harder to compare to the native Python solution. I suggest trying to find a middle ground, or at least giving more description of what is being measured.

You are right about the pre-processing being excluded and that it does skew the benchmark result, but I believe that choice was sound in terms of method comparison.

The benchmark script is just a quick run over 10,000 checks, i.e., orders of magnitude lower number compared to realistic problems in which elMatrix calculation time would be negligible. Even for the 10,000 checks, adding elMatrix calculation to the pythonNumpPy.py benchmark (extra _ = np.array([elList2vec(comp) for comp in elementalList], dtype=np.int8)) costs on the order of 1% (CPU Time [per comparison] 19.2 ns vs 19.4ns)

amkrajewski commented 2 months ago

The statement of need describes applications in economics and medicine as well as materials. I understand the materials science application (which is closer to my field of expertise, and is substantiated by the paper) but the economics and medicine use cases need citations.

Thank you for the suggestion!

amkrajewski commented 2 months ago

I'm new to nim and had some confusion while installing. It seems like it is best practice to use nimble install --depsOnly rather than listing the dependencies directly in the README. Also, I tried nimble install and nimble develop but was unsure how to get the nimCSO executable onto my PATH. The compile command nim c -r -f -d:release src/nimcso produces an executable in src/nimCSO.out but that doesn't seem like the proper "final destination." I am thinking of tools like pip install that can place entry points or executables directly onto your PATH.

In general, nimble will give you a similar experience to pip and allow you to overlay nim tools around nimCSO without any "manual labor" since nimCSO is indexed by it (PyPI equivalent) and lets you just install without any cloning or manual compiling.

nimCSO (and installation README) undergoes such gymnastics and places the binary in the local src rather than some more general location because it is meant to be compiled on a per-task basis, taking advantage of optimizations enabled by knowing task specifics. Thus, having it installed would (I think) confuse users and leave unnecessary files behind after task completion.

Thanks for suggesting nimble install --depsOnly. When there are only 1-3 top-level dependencies, I generally prefer to clearly state them in the installation instructions and explain what they are, but I can see the appeal of just using --depsOnly. I will add them as an alternative to the README.

amkrajewski commented 2 months ago

The repository would benefit from a CONTRIBUTING.md. It does not appear that the repository has had many issues or pull requests. It would be good to explicitly instruct users to file issues and pull requests.

Thank you for the suggestion!

I strongly recommend configuring a tool like pre-commit to automatically trim trailing whitespace and format the code. Having enforceable style conventions, even a minimal set, smooths the process of contributing to an open-source project. If you'd like help here, I can explain more and file a PR.

Thank you for the suggestion! I will add a few hooks that should be helpful here (and do the same for some other projects!).

The style conventions in nim language are a much more tricky subject compared to most languages. It is explicitly designed to not have code style conventions, even on the basic level like naming, and make programmers read code with how it will be parsed into AST in mind. The result is that collaborative projects use camelCase in one file to define a function and then kebab-case in another one to call it. There are some guidelines for the compiler contributions I can add as suggestions, but even the largest projects like Arraymancer drink the kool-aid.

amkrajewski commented 2 months ago

I think the paper would benefit from a toy example, explaining what it does. It took me some time to understand precisely what is meant by "selecting components" and phrases like "least preventing". Perhaps a toy example could look something like this?

You are 100% right. I was probably too focused on my problem and should have made it broader.

bdice commented 2 months ago

@amkrajewski Thanks for the preliminary feedback -- I finished my review. I marked where the second half of my review begins above. Overall the paper is in pretty good shape and it should be straightforward to get this to a state that I can approve.

All your responses above sound good.

Also, I marked my PR #4 as ready to review. Ping me when you've had a chance to look at my PR #4 and apply the suggestions above, and I will work on a second round of review.

RMeli commented 3 weeks ago

Hi @amkrajewski and @bdice is this issue still ongoing, or it can be resolved?

amkrajewski commented 3 weeks ago

I am still working on the last remaining point, i.e., adding a general-audience example of nimCSO use. I initially planned to use the suggested "pizza shop" example, which is excellent, but I decided to go with more general "ingredients in recipes" due to surprisingly extensive literature and data on the topic.

It sent me into a bit of a research rabbit hole, but I should be able to add it to the manuscript by this Monday. Additionally, after the revisions are complete, I plan to add a new tutorial using said ingredient data.

RMeli commented 2 weeks ago

Thank you for the update! FYI, I'll be away for the next two weeks, apologies in advice if my replies will be slow.

bdice commented 2 weeks ago

@amkrajewski Sounds good to me. Ping me when you're finished making changes and I'll update this issue and my review checklist!

amkrajewski commented 2 weeks ago

Hi @bdice, thanks again for taking the time and effort to provide valuable feedback! I believe the code and paper adjustments are ready for you to evaluate.

On a side note, as mentioned in my last comment, I am working on an example showing nimCSO deployed over cooking ingredients and spices, but I think it can come after the journal submission is finalized.

RMeli commented 2 days ago

Hi @bdice, when you evaluate the latest changes from @amkrajewski, please update your checklist in https://github.com/openjournals/joss-reviews/issues/6731#issuecomment-2183177055 accordingly, so that we can keep track of the progress. Many thanks!