Robust Time Series Chains

seanlaw commented 1 year ago

I just came across this new paper by Jessica Lin on a more robust time series chain.

Not sure if it is relevant or how it compares with #211 especially if we are able to leverage our top-k code

seanlaw commented 1 year ago

The paper also states the following:

The most closely related works are the ones by Zhu et al. [5] and Imamura et al. [6]. Zhu et al. [5] first introduce the concept of time series chains. The work enforces every adjacent pair of subsequences in the chain to be the left and right nearest neighbors of each other, and reports the longest chain as the top chain in the time series. Although the concept is simple and intuitive, the bi-directional nearest neighbor requirement is shown to be too strict in many real-world applications, as the chain can easily break by data fluctuations and noise [6]. Imamura et al. [6] relax the bi-direction condition by only enforcing the single-directional left nearest neighbor requirement, and added a pre-defined angle constraint to guarantee the directionality of the discovered chains. The concept is shown to be more robust than [5] in some conditions, but a new angle parameter is introduced, and we have observed that in some real-world scenarios, no meaningful chains can be found no matter how we set this parameter.

This is mostly consistent with what @NimaSarajpoor observed in his earlier explorations

andersooi commented 1 month ago

Hi! I see that this issue is a 'good first issue' and would like to see how I can help as a first-time contributor to this project

NimaSarajpoor commented 1 month ago

@andersooi Thanks for keeping your interest in contributing to this library! Since I contributed before to something similar (see #211), I thought it might be a good idea to provide my input, which is mostly based on the information provided by @seanlaw in this comment.

Try to make yourself familiar with the contribution process: See Contributor's Guide
Personally, I prefer to understand the basic first. So, taking a look at this notebook might be helpful.
Start with creating a notebook docs/Tutorial_robust_chains.ipynb. The goal is to see if two or more figures from the original paper, robust time series chain, can be reproduced based on the proposed algorithms. This notebook can be considered as a tutorial for the paper. For instance, see Time Series Chain and Time Series Geometric Chains You can follow a similar structure in your notebook.
Try to submit a PR as you make progress in your notebook rather than waiting till the very end when the whole notebook is complete. Pushing your commits that contain small changes can help reviewer follow your work more easily and provide more constructive feedback. As a side, for the sake of consistency, the name of variables should be similar to what exists in STUMPY's codebase. Maybe checkout the docstrings in stumpy/stump.py (and stumpy/chains.py) to make yourself familiar with some of the variables names.
Eventually, a decision can be made to see if the notebook should be merged and if its functions should be moved to a .py files to create an API.

I would be happy to help you along the way. Feel free to ask any questions you may have.

@seanlaw Please provide any additional comments you may have as you certainly have better vision.

andersooi commented 1 month ago

Thanks @NimaSarajpoor for your insights! I will start by familiarising myself with the contribution guide and looking through the stumpy basics tutorial notebook first.

andersooi commented 1 month ago

Hi @seanlaw @NimaSarajpoor! I'm working on a Mac and have been following the contributor guide to fork and clone the repo, set up a virtual environment (.venv in root directory), and install the dependencies using the requirements.txt file.

So now I'm trying to run the unit tests. When running ./setup.sh, it uninstalls and reinstalls stumpy, and based on what I have read, this seems to be correct.

However, when running ./test.sh, there's a whole bunch of code that appears and at the end of each block there would be a line that says e.g. would reformat /Users/Anders_1/Desktop/Projects/stumpy/.venv/lib/python3.11/site-packages/scipy/stats/tests/test_stats.py and at the end, the following message is produced:

Oh no! 💥 💔 💥
2387 files would be reformatted, 779 files would be left unchanged, 1 file would fail to reformat.
Error: Test execution encountered exit code 123

There doesn't seem to be any indication of test cases passing or failing, so I'm not exactly sure what is wrong here.

Any help is appreciated!

seanlaw commented 1 month ago

@andersooi This looks very familiar and I believe @joehiggi1758 had encountered this recently in issue https://github.com/TDAmeritrade/stumpy/issues/957#issuecomment-2128506321. Would it be possible to swithc to using conda instead?

Oh no! 💥 💔 💥 2387 files would be reformatted, 779 files would be left unchanged, 1 file would fail to reformat. Error: Test execution encountered exit code 123

This is coming from black and, currently, it looks like black is looking (recursively) at files in your root/home directory rather than only in the cloned stumpy directory. Unfortunately, I don't use .vnenv and so it's not entirely clear to me why this is happening. I'm thinking that you may need to change directories into the cloned stumpy directory. Perhaps @JosephTLucas would be willing to provide some pointers here? Otherwise, we may need to remove the venv line from the Contributors Guide.

@andersooi What happens when you do

$ ./test.sh show

andersooi commented 1 month ago

Hi @seanlaw sorry for the late reply!

I took a look at the comment in the issue that you linked and it seems to be the same thing that I am facing. I already changed the directory to the cloned stumpy directory, but the issue persists. I will try out using conda instead.

When I ran ./test.sh show, this was the output:

(.venv) Anders@192 stumpy %  ./test.sh show 
Show development/test environment
Current working directory:  /Users/Anders_1/Desktop/Projects/stumpy
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'black'
Black version: 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'flake8'
Flake8 versoin: 
Python version:  3.11.4
NumPy version:  2.0.0
SciPy version:  1.14.0
Numba version:  0.60.0

[UPDATE]: Installed conda and created my virtual environment, ran ./conda.sh and ./setup.sh, both seemed fine. When I ran ./test.sh, got some ModuleNotFoundErrors for pytest, dask, distributed so went ahead and did conda install pytest dask distributed. Ran the test file again and got the following output:

(stumpy-env) Anders@192 stumpy % ./test.sh
Cleaning Up
Checking Black Code Formatting
All done! ✨ 🍰 ✨
90 files would be left unchanged.
Checking iSort Import Formatting
Skipped 1 files
Checking Flake8 Style Guide Enforcement
Checking Missing Docstrings
Ray Not Installed
Executing Unit Tests And Code Coverage
Testing Numba JIT Compiled Functions
=================================================================== test session starts ====================================================================
platform darwin -- Python 3.12.4, pytest-7.4.4, pluggy-1.5.0
rootdir: /Users/Anders_1/Desktop/Projects/stumpy
configfile: pytest.ini
collected 0 items / 1 error                                                                                                                                

========================================================================== ERRORS ==========================================================================
___________________________________________________________ ERROR collecting tests/test_aamp.py ____________________________________________________________
tests/test_aamp.py:1: in <module>
    import naive
tests/naive.py:5: in <module>
    from scipy.stats import norm
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/scipy/stats/__init__.py:610: in <module>
    from ._stats_py import *
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/scipy/stats/_stats_py.py:49: in <module>
    from . import distributions
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/scipy/stats/distributions.py:10: in <module>
    from . import _continuous_distns
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py:12: in <module>
    from scipy.interpolate import BSpline
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/scipy/interpolate/__init__.py:167: in <module>
    from ._interpolate import *
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/scipy/interpolate/_interpolate.py:12: in <module>
    from . import _fitpack_py
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/scipy/interpolate/_fitpack_py.py:8: in <module>
    from ._fitpack_impl import bisplrep, bisplev, dblint  # noqa: F401
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/scipy/interpolate/_fitpack_impl.py:103: in <module>
    'iwrk': array([], dfitpack_int), 'u': array([], float),
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/numpy/core/_dtype.py:46: in __repr__
    arg_str = _construction_repr(dtype, include_align=False)
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/numpy/core/_dtype.py:100: in _construction_repr
    return _scalar_str(dtype, short=short)
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/numpy/core/_dtype.py:143: in _scalar_str
    elif np.issubdtype(dtype, np.number):
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/numpy/core/numerictypes.py:417: in issubdtype
    arg1 = dtype(arg1).type
../../../opt/anaconda3/envs/stumpy-env/lib/python3.12/site-packages/numpy/core/_dtype.py:46: in __repr__
    arg_str = _construction_repr(dtype, include_align=False)
E   RecursionError: maximum recursion depth exceeded
!!! Recursion detected (same locals & position)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
==================================================================== 1 error in 50.58s =====================================================================
Error: Test execution encountered exit code 2

JosephTLucas commented 1 month ago

Reading through this and #957 , I believe this is the issue:

Python venv creates a directory for all of the python dependencies. (FWIW, conda does something similar, but out in the user's home directory)
test.sh doesn't exclude that directory when using black so black is executing against all of those virtual environment directories.

Since .venv is in the .gitignore, I think we just need to add --extend-exclude to the black command in test.sh. Reference

seanlaw commented 1 month ago

When I ran ./test.sh show, this was the output:

This output shows that your environment doesn't have black installed and yet you were still able to run ./test.sh

[UPDATE]: Installed conda and created my virtual environment, ran ./conda.sh and ./setup.sh, both seemed fine. When I ran ./test.sh, got some ModuleNotFoundErrors for pytest, dask, distributed so went ahead and did conda install pytest dask distributed.

That's REALLY strange. ./conda.sh basically installs the packages listed in environment.yml, which contains dask, distributed, and pytest (via pytest-cov). Did you happen to notice any errors/warnings when you ran ./conda.sh?

Ran the test file again and got the following output:

@andersooi Can you please try running test.sh again and see if the tests still fail with the same error?

seanlaw commented 1 month ago

Since .venv is in the .gitignore, I think we just need to add --extend-exclude to the black command in test.sh.

Thanks @JosephTLucas! I will investigate further

andersooi commented 1 month ago

That's REALLY strange. ./conda.sh basically installs the packages listed in environment.yml, which contains dask, distributed, and pytest (via pytest-cov). Did you happen to notice any errors/warnings when you ran ./conda.sh?

So I deleted everything and tried again from the start. Running ./conda.sh on the first time gave two PackageNotInstalledErrors:

PackageNotInstalledError: Package is not installed in prefix.
  prefix: /Users/Anders_1/opt/anaconda3/envs/stumpy-env
  package name: conda

and

PackageNotInstalledError: Package is not installed in prefix.                                                                                               
  prefix: /Users/Anders_1/opt/anaconda3/envs/stumpy-env
  package name: numpy

so I went ahead and ran ./conda.sh again, and there weren't any warnings or errors.

@andersooi Can you please try running test.sh again and see if the tests still fail with the same error?

This time, running ./test.sh took a while, almost an hour - not too sure if it's supposed to be that long or maybe because I'm not working with a GPU, though I see a lot of repeated tests (I am guessing because they were skipped or something and the script is trying to test it again?)

Anyways, the tests eventually stopped and at the end, this was the message:

=================================================================== 21 passed in 18.39s ====================================================================
Ray Not Installed
Name    Stmts   Miss  Cover   Missing
-------------------------------------
TOTAL   14448      0   100%

86 files skipped due to complete coverage.
Cleaning Up

Here is a link to a pastecode link showing the entire test process: https://pastecode.io/s/fi7fguid

I'm guessing that I have to install Ray and try running ./test.sh again? Though waiting for 1h for the entire tests to complete is rather long 😅

seanlaw commented 1 month ago

PackageNotInstalledError: Package is not installed in prefix.

This error is rather strange but i'm glad that re-running ./conda.sh fixed it (though it should have installed correctly the firs time if your environment was set up correctly).

so I went ahead and ran ./conda.sh again, and there weren't any warnings or errors.

Excellent!

Anyways, the tests eventually stopped and at the end, this was the message:

This is a great sign. Note that we have two rounds of testing. The first one executes all unit tests using numba JIT-compiled functions and so this is expected to be fast. However, the second round of running the unit tests is for coverage testing and this will turn off all JIT compiling, which forces everything to be much slower.

You can run the fast unit tests by itself with ./test.sh unit or the coverage tests by itself via ./test.sh coverage. However, in this case, you'll be focused on creating a reproducible notebook (i.e., you aren't adding code to the code base... yet) so testing can be ignored for now.

I'm guessing that I have to install Ray and try running ./test.sh again?

You can ignore ray completely

Though waiting for 1h for the entire tests to complete is rather long 😅

I agree. I noticed this yesterday as well and I believe we know where the recent increase in time is coming from (see this issue). Of course, this is also dependent on the number of threads and the speed of those threads. We have over 1500 tests that need to be run but, to reiterate, this is not so important for trying to reproduce the robust time series chains and then possible consider it as a throwaway prototype from which to build off of if necessary. Does that make sense @andersooi?

andersooi commented 1 month ago

This is a great sign. Note that we have two rounds of testing. The first one executes all unit tests using numba JIT-compiled functions and so this is expected to be fast. However, the second round of running the unit tests is for coverage testing and this will turn off all JIT compiling, which forces everything to be much slower.

You can run the fast unit tests by itself with ./test.sh unit or the coverage tests by itself via ./test.sh coverage. However, in this case, you'll be focused on creating a reproducible notebook (i.e., you aren't adding code to the code base... yet) so testing can be ignored for now.

Ahh, now I understand why there were repeated tests. This makes sense.

I agree. I noticed this yesterday as well and I believe we know where the recent increase in time is coming from (see this issue). Of course, this is also dependent on the number of threads and the speed of those threads. We have over 1500 tests that need to be run but, to reiterate, this is not so important for trying to reproduce the robust time series chains and then possible consider it as a throwaway prototype from which to build off of if necessary. Does that make sense @andersooi?

Yup, thank you for the detailed explanations!

I have gone through the Tutorial_STUMPY_Basics notebook to familiarise myself with the basics, will proceed with creating the notebook.

Will reach out if I have any questions or run into any difficulties!

seanlaw commented 1 month ago

I have gone through the Tutorial_STUMPY_Basics notebook to familiarise myself with the basics, will proceed with creating the notebook.

@andersooi Instead of the Tutorial_STUMPY_Basics notebook, I might recommend this "work-in-progress" reproducer and its corresponding notebook. The Tutorial_STUMPY_Basics notebook is already "polished"/fully vetted and the underlying code has already been converted into stumpy functions and pushed into the code base. However, the "dirtier" damp code example is still being iterated/debated upon and might provide a better perspective on the expectations (i.e., that we want to see all of the code and whether or not it actually "works" as well as claimed by the authors). If it turns out to be too brittle (or the results from the paper aren't reproducible) and it can't actually be general enough to be applied broadly to other use cases then we should avoid putting it into the hands of our users and abandon ship! This is essentially our due diligence phase.

andersooi commented 1 month ago

@seanlaw sure thing, I will give it a look!

TDAmeritrade / stumpy

Robust Time Series Chains #730