Open bdice opened 2 months ago
Hi @bdice Thank you for taking the time to provide such detailed feedback! And, looking at PR #4 , for providing some solutions!
I'm traveling this week for a conference in Italy, so I'm a bit time-limited, but I should be able to revise everything point-by-point over the next two weeks.
The GitHub Codespaces took a long time to open (>5 minutes). I would try consolidating the conda install commands. The nim packages on conda-forge depend on conda packages of gcc so it is possible to avoid the 'apt install gcc' steps entirely.
Thanks for catching that! I accidentally limited the pre-building scope to the Eastern US. Now, it should start in 30ish seconds worldwide.
apt-get install nim did not work on my Ubuntu system. Perhaps the "universe" channel is not enabled by default? I see the package here: https://packages.ubuntu.com/noble/nim but it is version 1.6, while this software claims to require Nim 2.0 in nimcso.nimble. Please verify that sufficiently recent versions of nim are available from the recommended installation paths.
Indeed, we require 2.0
in the nimble file, but that's more about what we provide support for (and recommend performance-wise). It should compile fine under 1.6.x
and work as expected.
I looked at the "native Python" benchmark first. It appears that the NumPy benchmark is not measuring the same amount of work as the native Python benchmark. For example, the construction of elMatrix from the input data should be a part of the timing for the NumPy benchmark (and similarly for the nimCSO benchmark). It's okay to compare algorithm runtime that excludes the CSV parsing (such as the cost to create elementalList in the native Python benchmark) but it is not a fair comparison to pre-process that parsed data by converting into a matrix for NumPy or bit-packed array. Make sure that all benchmarks are measuring as close to the same thing as possible when making relative claims about performance. I understand it is important to amortize that cost of creating a bit-packed array for the input data when computing across many element lists, but it's a bit harder to compare to the native Python solution. I suggest trying to find a middle ground, or at least giving more description of what is being measured.
You are right about the pre-processing being excluded and that it does skew the benchmark result, but I believe that choice was sound in terms of method comparison.
The benchmark script is just a quick run over 10,000 checks, i.e., orders of magnitude lower number compared to realistic problems in which elMatrix
calculation time would be negligible. Even for the 10,000 checks, adding elMatrix
calculation to the pythonNumpPy.py
benchmark (extra _ = np.array([elList2vec(comp) for comp in elementalList], dtype=np.int8)
) costs on the order of 1% (CPU Time [per comparison] 19.2 ns vs 19.4ns)
The statement of need describes applications in economics and medicine as well as materials. I understand the materials science application (which is closer to my field of expertise, and is substantiated by the paper) but the economics and medicine use cases need citations.
Thank you for the suggestion!
I'm new to nim and had some confusion while installing. It seems like it is best practice to use nimble install --depsOnly rather than listing the dependencies directly in the README. Also, I tried nimble install and nimble develop but was unsure how to get the nimCSO executable onto my PATH. The compile command nim c -r -f -d:release src/nimcso produces an executable in src/nimCSO.out but that doesn't seem like the proper "final destination." I am thinking of tools like pip install that can place entry points or executables directly onto your PATH.
In general, nimble
will give you a similar experience to pip
and allow you to overlay nim
tools around nimCSO
without any "manual labor" since nimCSO
is indexed by it (PyPI
equivalent) and lets you just install without any cloning or manual compiling.
nimCSO
(and installation README
) undergoes such gymnastics and places the binary in the local src
rather than some more general location because it is meant to be compiled on a per-task basis, taking advantage of optimizations enabled by knowing task specifics. Thus, having it installed would (I think) confuse users and leave unnecessary files behind after task completion.
Thanks for suggesting nimble install --depsOnly
. When there are only 1-3 top-level dependencies, I generally prefer to clearly state them in the installation instructions and explain what they are, but I can see the appeal of just using --depsOnly
. I will add them as an alternative to the README
.
README
install instructionsnimble install --depsOnly
option to README
The repository would benefit from a CONTRIBUTING.md. It does not appear that the repository has had many issues or pull requests. It would be good to explicitly instruct users to file issues and pull requests.
Thank you for the suggestion!
CONTRIBUTING.md
I strongly recommend configuring a tool like pre-commit to automatically trim trailing whitespace and format the code. Having enforceable style conventions, even a minimal set, smooths the process of contributing to an open-source project. If you'd like help here, I can explain more and file a PR.
Thank you for the suggestion! I will add a few hooks that should be helpful here (and do the same for some other projects!).
The style conventions in nim
language are a much more tricky subject compared to most languages. It is explicitly designed to not have code style conventions, even on the basic level like naming, and make programmers read code with how it will be parsed into AST in mind. The result is that collaborative projects use camelCase in one file to define a function and then kebab-case in another one to call it. There are some guidelines for the compiler contributions I can add as suggestions, but even the largest projects like Arraymancer drink the kool-aid.
check-yaml
, check-case-conflict
, check-illegal-windows-names
, end-of-file-fixer
, and trailing-whitespace
nim
guidelines for the compiler contributions as a suggestion to CONTRIBUTING.md
I think the paper would benefit from a toy example, explaining what it does. It took me some time to understand precisely what is meant by "selecting components" and phrases like "least preventing". Perhaps a toy example could look something like this?
You are 100% right. I was probably too focused on my problem and should have made it broader.
nimCSO
and problems associated with interdependency.@amkrajewski Thanks for the preliminary feedback -- I finished my review. I marked where the second half of my review begins above. Overall the paper is in pretty good shape and it should be straightforward to get this to a state that I can approve.
All your responses above sound good.
Also, I marked my PR #4 as ready to review. Ping me when you've had a chance to look at my PR #4 and apply the suggestions above, and I will work on a second round of review.
Hi @amkrajewski and @bdice is this issue still ongoing, or it can be resolved?
I am still working on the last remaining point, i.e., adding a general-audience example of nimCSO
use. I initially planned to use the suggested "pizza shop" example, which is excellent, but I decided to go with more general "ingredients in recipes" due to surprisingly extensive literature and data on the topic.
It sent me into a bit of a research rabbit hole, but I should be able to add it to the manuscript by this Monday. Additionally, after the revisions are complete, I plan to add a new tutorial using said ingredient data.
Thank you for the update! FYI, I'll be away for the next two weeks, apologies in advice if my replies will be slow.
@amkrajewski Sounds good to me. Ping me when you're finished making changes and I'll update this issue and my review checklist!
Hi @bdice, thanks again for taking the time and effort to provide valuable feedback! I believe the code and paper adjustments are ready for you to evaluate.
On a side note, as mentioned in my last comment, I am working on an example showing nimCSO
deployed over cooking ingredients and spices, but I think it can come after the journal submission is finalized.
Hi @bdice, when you evaluate the latest changes from @amkrajewski, please update your checklist in https://github.com/openjournals/joss-reviews/issues/6731#issuecomment-2183177055 accordingly, so that we can keep track of the progress. Many thanks!
Hi @amkrajewski, I am one of the reviewers for your paper: https://github.com/openjournals/joss-reviews/issues/6731
I am copying a portion of the reviewer checklist here, with some notes that I have for improvement. I've checked most of the boxes on the original review, so I will leave those out of this list.
I am also opening a pull request with some suggested changes: https://github.com/amkrajewski/nimCSO/pull/4
I will add some comments on that pull request and link to the corresponding notes in this issue.
Review checklist for @bdice
Functionality
[ ] Installation: Does installation proceed as outlined in the documentation?
apt-get install nim
did not work on my Ubuntu system. Perhaps the "universe" channel is not enabled by default? I see the package here: https://packages.ubuntu.com/noble/nim but it is version 1.6, while this software claims to require Nim 2.0 innimcso.nimble
. Please verify that sufficiently recent versions of nim are available from the recommended installation paths.[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)
I looked at the "native Python" benchmark first. It appears that the NumPy benchmark is not measuring the same amount of work as the native Python benchmark. For example, the construction of
elMatrix
from the input data should be a part of the timing for the NumPy benchmark (and similarly for the nimCSO benchmark). It's okay to compare algorithm runtime that excludes the CSV parsing (such as the cost to createelementalList
in the native Python benchmark) but it is not a fair comparison to pre-process that parsed data by converting into a matrix for NumPy or bit-packed array. Make sure that all benchmarks are measuring as close to the same thing as possible when making relative claims about performance. I understand it is important to amortize that cost of creating a bit-packed array for the input data when computing across many element lists, but it's a bit harder to compare to the native Python solution. I suggest trying to find a middle ground, or at least giving more description of what is being measured.Documentation
The statement of need describes applications in economics and medicine as well as materials. I understand the materials science application (which is closer to my field of expertise, and is substantiated by the paper) but the economics and medicine use cases need citations.
I'm new to
nim
and had some confusion while installing. It seems like it is best practice to usenimble install --depsOnly
rather than listing the dependencies directly in the README. Also, I triednimble install
andnimble develop
but was unsure how to get thenimCSO
executable onto my PATH. The compile commandnim c -r -f -d:release src/nimcso
produces an executable insrc/nimCSO.out
but that doesn't seem like the proper "final destination." I am thinking of tools likepip install
that can place entry points or executables directly onto your PATH.The repository would benefit from a CONTRIBUTING.md. It does not appear that the repository has had many issues or pull requests. It would be good to explicitly instruct users to file issues and pull requests.
I strongly recommend configuring a tool like pre-commit to automatically trim trailing whitespace and format the code. Having enforceable style conventions, even a minimal set, smooths the process of contributing to an open-source project. If you'd like help here, I can explain more and file a PR.
Software paper
I think the paper would benefit from a toy example, explaining what it does. It took me some time to understand precisely what is meant by "selecting components" and phrases like "least preventing". Perhaps a toy example could look something like this?
Obviously this is very contrived, but I think the problem statement and motivation for this work could be clearer, and connect the domain field (materials science / compositionally complex materials) to the broadly-applicable selection algorithms implemented in nimCSO.
Second part of the review begins here
I skimmed the paper on a second read but didn't see any comparisons to other software. How do people solve this problem today? Is it all custom scripts or are there other libraries out there? (e.g. in other programming languages, with different search algorithms, etc.) Especially with the mention of other fields where this is useful (economics, medicine), I think prior art probably exists.
There are a few typos that I have addressed in #4. Generally the writing quality is good.
Some things noted above would be better with a citation (describing examples in economics/medicine, and adding references to other software in this space). Citation style looks fine. One note: the citation for "Genetic algorithms in search, optimization and machine learning" (https://doi.org/10.5860/choice.27-0936) doesn't looks like a useful link. It goes to a login page that has no information about the title. Can you verify this?