aadya940 / chainopy

ChainoPy: A Python Library for Discrete Time Markov Chain based stochastic analysis
https://chainopy.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
12 stars 1 forks source link

[JOSS Review] Software Paper #20

Closed braniii closed 3 weeks ago

braniii commented 1 month ago

Hi @aadya940,

this is the final part of my review and is about the paper.

Here is a list with open tasks:

Kind regards Daniel Nagel

braniii commented 1 month ago

And here some additional notes about the comment on the statement of need:

While the package appears to be designed for improved speed compared to pyDTMC, I noticed that generating Markov chain Monte Carlo trajectories with chainopy seems to be relatively slow. In my experience, using alternative tools like deeptime-ml or msmhelper can provide a significant performance boost, sometimes improving speed by 1-3 orders of magnitude, depending on the number of steps required (see attachment). Is this feature important to your community and if, how many steps are typically needed?

To run this benchmark, you need to run first

python -m pip install msmhelper

benchmarks_simulate_mh.zip

aadya940 commented 1 month ago

Hi @braniii,

First off, I’d like to address the claim that chainopy is designed to be faster than packages that rely solely on Numpy + Python, such as PyDTMC. This is because Python loops tend to be slow due to its dynamic nature, and Markov chains involve many iterative algorithms (see this Stack Overflow answer).

I observed that markov sections of deeptime are written in pure C (https://github.com/deeptime-ml/deeptime) which is always faster than Cython. While a library like MSM Helper might benefit from Numba's JIT compilation due to its access to runtime variables, which can offer performance advantages over AOT compilers like Cython, speed is only one aspect of the challenge. Memory management is equally crucial. For instance, a bigram model like Markov Chain can lead to a significant increase in the size of the transition matrix. A corpus with just 100 unique words could consume 78.125MB (considering double dtype). Additionally, as the transition matrix becomes more sparse, it’s important to account for sparsity during model saving, which I don’t believe msmhelper does while we do. We also plan to add support for memory-mapped arrays for those that don't fit into memory (see issue #11).

Moreover, I’ve developed chainopy with specific use cases in mind, such as text modeling and stock prediction. While MCMC applications are undoubtedly significant, there are already highly optimized packages like PyMC, which are built on tensor libraries and graph compilers like PyTensor and Aesara. Therefore, a more appropriate comparison would be against PyMC.

Thank you for your attention to these points. I hope I've addressed your queries with utmost respect to other libraries and its authors.

aadya940 commented 1 month ago

The comparison with pyDTMC pertains strictly to the core aspects of Markov Chains. The primary goal of chainopy is to expand upon these functionalities, enabling the implementation of advanced workflows such as Markov Switching Models, Markov Chain Neural Networks, and Hidden Markov Models (HMMs). These capabilities set chainopy apart from pyDTMC, offering a broader range of applications and increased flexibility for complex modeling tasks.

aadya940 commented 1 month ago

@braniii I've mentioned a few packages in the text modelling domain that use pure python and numpy.

aadya940 commented 1 month ago

@braniii Also I added a few lines in the Statement of Need. I hope everything looks good to you now :))

braniii commented 1 month ago

Hi @braniii,

First off, I’d like to address the claim that chainopy is designed to be faster than packages that rely solely on Numpy + Python, such as PyDTMC. This is because Python loops tend to be slow due to its dynamic nature, and Markov chains involve many iterative algorithms (see this Stack Overflow answer).

I observed that markov sections of deeptime are written in pure C (https://github.com/deeptime-ml/deeptime) which is always faster than Cython. While a library like MSM Helper might benefit from Numba's JIT compilation due to its access to runtime variables, which can offer performance advantages over AOT compilers like Cython, speed is only one aspect of the challenge. Memory management is equally crucial. For instance, a bigram model like Markov Chain can lead to a significant increase in the size of the transition matrix. A corpus with just 100 unique words could consume 78.125MB (considering double dtype). Additionally, as the transition matrix becomes more sparse, it’s important to account for sparsity during model saving, which I don’t believe msmhelper does while we do. We also plan to add support for memory-mapped arrays for those that don't fit into memory (see issue #11).

Moreover, I’ve developed chainopy with specific use cases in mind, such as text modeling and stock prediction. While MCMC applications are undoubtedly significant, there are already highly optimized packages like PyMC, which are built on tensor libraries and graph compilers like PyTensor and Aesara. Therefore, a more appropriate comparison would be against PyMC.

Thank you for your attention to these points. I hope I've addressed your queries with utmost respect to other libraries and its authors.

Thank you for this detailed answer. It is true, that I am not aware of any package that saves the matrix in a sparse format. However, due to the fast estimation of the transition matrix ~µs to ~ms, I guess, this is typically estimated on-the-fly. And true, PyMC is surely outperforming all other mentioned libraries due to its native cuda-optmized jax backend. At least on large models.

braniii commented 1 month ago

@braniii Also I added a few lines in the Statement of Need. I hope everything looks good to you now :))

These are really helping to underline the statement of chainopy :+1:

ps: Please fix the missing dot at the last sentence.