TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.67k stars 320 forks source link

Lacking coverage for all edge cases identified when updating min.py 'find_pkg_mismatches' #996

Closed joehiggi1758 closed 4 months ago

joehiggi1758 commented 5 months ago

We need a way to quickly check whether or not our regex has stopped covering one of our identified/existing cases in min.py find_pkg_mismatches(). This may not need to be a unit test, but we should implement some protection.

The list of all edge case formats that we've identified so far...

seanlaw commented 5 months ago

I'm thinking that we can split out the generic regex part into its own function ():

def match_str_version(line, pkg_name):
    matches = re.search(
                    rf"""
                        {pkg_name}  # Package name
                        [=><:"\'\[\]]+  # Zero or more special characters
                        (\d+\.\d+[\.0-9]*)  # Capture "version" in `matches`
                        """,
                    line,
                    re.VERBOSE,  # Ignores all whitespace in pattern
                )
    return matches

add then we update our existing function:

def find_pkg_mismatches(pkg_name, pkg_version, fnames):
    """
    Determine if any package version has mismatches
    """
    pkg_mismatches = []

    for fname in fnames:
        with open(fname, "r") as file:
            for line_num, line in enumerate(file, start=1):
                l = line.strip().replace(" ", "").lower()
                matches = match_str_version(l, pkg_name)
                if matches is not None:
                    version = matches.groups()[0]
                    if version != pkg_version:
                        pkg_mismatches.append((pkg_name, version, fname, line_num))

    return pkg_mismatches

Finally, we can add a new test function that looks like:

def test_pkg_mismatch_regex():
    pkgs = {
        "numpy": "0.0",
        "scipy": "0.0",
        "numba": "0.0",
        "python": "2.7",
        "python-version": "2.7",
    }
    lines = [
        'Programming Language :: Python :: 3.8',
        'STUMPY supports Python 3.8',
        "python-version: ['3.8']",
        'requires-python = ">=3.8"',
        'numba>=0.55.2',
    ]

    for pkg_name, pkg_version in pkgs.items():
        for line in lines:
            matches = match_str_version(line, pkg_name)
            if matches is None:
                # Raise an error 

And so this test would/should be executed BEFORE calling find_pkg_mismatches.

joehiggi1758 commented 5 months ago

Sounds great! Please excuse my naïveté, but looking to learn! Why use a validation function as opposed to a unit test here? (Aka where does that line come in)

seanlaw commented 5 months ago

@joehiggi1758 There no hard rule (i.e., it's more by feeling) but my rationale for this case is that this min.py script is a utility file that really has nothing to do with matrix profiles (i.e., if the "mismatch" test is in the tests/ directory and it fails then it will make it seem like STUMPY is failing, which isn't ever the message that I want to send). Thus, I am trying to limit our unit test suite to be for matrix profiles only. In reality, I should/could split out all of these utility scripts into a separate package but, alas, I'm too lazy and here we are. I hope that makes sense? At the same time, this is a perfectly good opportunity to argue/debate the merits of each option (there may be additional options that I have no considered too).

In some situations, I can/will be stubborn but, in this case, I'm not sure what the best thing to do is. To some extent, I'm just taking the easy way out :)