[JOSS Review] Software Paper

braniii commented 1 month ago

Hi @aadya940,

this is the final part of my review and is about the paper.

Here is a list with open tasks:

[x] Add a summary describing the high-level functionality and purpose of the software for a diverse, non-specialist audience.
[x] Please improve on the summary. You are not following the typical structure of a scientific abstract.
[x] You start the statement of need with the sentence "There are significant limitations in current Markov Chain packages that rely solely on NumPy (Harris et al., 2020) and Python for implementation." I guess, this highly depends on the field you are working on. In case of biophysics (Markov Modeling) most packages, e.g. pyemma, deeptime, msmhelper, are relying on Cython or Numba. So, I guess it would be helpful to cite the major packages of your community?
[x] And regarding the Statement of Need. As far as I understand it, the necessity of chainopy is that it is faster than pyDTMC. When I look at the benchmarks, there are significant improvements in fast functions (<1ns) which probably do not limit existing workflows due to their fast execution. For the slower functions (simulate), on the other hand, there is a maximum of up to 50% performance improvement. Can you please comment on the extent to which ChainoPy allows new workflows, or whether the added value does not lie more in the additional algorithms.

Kind regards Daniel Nagel

braniii commented 1 month ago

And here some additional notes about the comment on the statement of need:

While the package appears to be designed for improved speed compared to pyDTMC, I noticed that generating Markov chain Monte Carlo trajectories with chainopy seems to be relatively slow. In my experience, using alternative tools like deeptime-ml or msmhelper can provide a significant performance boost, sometimes improving speed by 1-3 orders of magnitude, depending on the number of steps required (see attachment). Is this feature important to your community and if, how many steps are typically needed?

To run this benchmark, you need to run first

python -m pip install msmhelper

benchmarks_simulate_mh.zip

aadya940 commented 1 month ago

Hi @braniii,

First off, I’d like to address the claim that chainopy is designed to be faster than packages that rely solely on Numpy + Python, such as PyDTMC. This is because Python loops tend to be slow due to its dynamic nature, and Markov chains involve many iterative algorithms (see this Stack Overflow answer).

I observed that markov sections of deeptime are written in pure C (https://github.com/deeptime-ml/deeptime) which is always faster than Cython. While a library like MSM Helper might benefit from Numba's JIT compilation due to its access to runtime variables, which can offer performance advantages over AOT compilers like Cython, speed is only one aspect of the challenge. Memory management is equally crucial. For instance, a bigram model like Markov Chain can lead to a significant increase in the size of the transition matrix. A corpus with just 100 unique words could consume 78.125MB (considering double dtype). Additionally, as the transition matrix becomes more sparse, it’s important to account for sparsity during model saving, which I don’t believe msmhelper does while we do. We also plan to add support for memory-mapped arrays for those that don't fit into memory (see issue #11).

Moreover, I’ve developed chainopy with specific use cases in mind, such as text modeling and stock prediction. While MCMC applications are undoubtedly significant, there are already highly optimized packages like PyMC, which are built on tensor libraries and graph compilers like PyTensor and Aesara. Therefore, a more appropriate comparison would be against PyMC.

Thank you for your attention to these points. I hope I've addressed your queries with utmost respect to other libraries and its authors.

aadya940 commented 1 month ago

The comparison with pyDTMC pertains strictly to the core aspects of Markov Chains. The primary goal of chainopy is to expand upon these functionalities, enabling the implementation of advanced workflows such as Markov Switching Models, Markov Chain Neural Networks, and Hidden Markov Models (HMMs). These capabilities set chainopy apart from pyDTMC, offering a broader range of applications and increased flexibility for complex modeling tasks.

aadya940 commented 1 month ago

@braniii I've mentioned a few packages in the text modelling domain that use pure python and numpy.

aadya940 commented 1 month ago

@braniii Also I added a few lines in the Statement of Need. I hope everything looks good to you now :))

braniii commented 1 month ago

Hi @braniii,

First off, I’d like to address the claim that chainopy is designed to be faster than packages that rely solely on Numpy + Python, such as PyDTMC. This is because Python loops tend to be slow due to its dynamic nature, and Markov chains involve many iterative algorithms (see this Stack Overflow answer).

I observed that markov sections of deeptime are written in pure C (https://github.com/deeptime-ml/deeptime) which is always faster than Cython. While a library like MSM Helper might benefit from Numba's JIT compilation due to its access to runtime variables, which can offer performance advantages over AOT compilers like Cython, speed is only one aspect of the challenge. Memory management is equally crucial. For instance, a bigram model like Markov Chain can lead to a significant increase in the size of the transition matrix. A corpus with just 100 unique words could consume 78.125MB (considering double dtype). Additionally, as the transition matrix becomes more sparse, it’s important to account for sparsity during model saving, which I don’t believe msmhelper does while we do. We also plan to add support for memory-mapped arrays for those that don't fit into memory (see issue #11).

Moreover, I’ve developed chainopy with specific use cases in mind, such as text modeling and stock prediction. While MCMC applications are undoubtedly significant, there are already highly optimized packages like PyMC, which are built on tensor libraries and graph compilers like PyTensor and Aesara. Therefore, a more appropriate comparison would be against PyMC.

Thank you for your attention to these points. I hope I've addressed your queries with utmost respect to other libraries and its authors.

Thank you for this detailed answer. It is true, that I am not aware of any package that saves the matrix in a sparse format. However, due to the fast estimation of the transition matrix ~µs to ~ms, I guess, this is typically estimated on-the-fly. And true, PyMC is surely outperforming all other mentioned libraries due to its native cuda-optmized jax backend. At least on large models.

braniii commented 1 month ago

@braniii Also I added a few lines in the Statement of Need. I hope everything looks good to you now :))

These are really helping to underline the statement of chainopy :+1:

ps: Please fix the missing dot at the last sentence.

aadya940 / chainopy

[JOSS Review] Software Paper #20