Data-Simply / pyretailscience

pyretailscience - A data analysis and science toolkit for detail data
Other
5 stars 1 forks source link

feat: added production association rule module #69

Closed mvanwyk closed 4 months ago

mvanwyk commented 4 months ago

PR Type

Enhancement, Documentation, Tests


Description


Changes walkthrough πŸ“

Relevant files
Enhancement
product_association.py
Implement product association rules generation module.     

pyretailscience/product_association.py
  • Added ProductAssociation class for generating product association
    rules.
  • Implemented methods to calculate support, confidence, and uplift
    metrics.
  • Included validation for input parameters and data.
  • +304/-0 
    Tests
    test_product_association.py
    Add tests for product association rules module.                   

    tests/test_product_association.py
  • Added tests for ProductAssociation class.
  • Included fixtures for sample data and expected results.
  • Tested various configurations and edge cases.
  • +330/-0 
    Documentation
    analysis_modules.md
    Document product association rules module.                             

    docs/analysis_modules.md
  • Documented the product association rules module.
  • Provided examples and use cases.
  • Explained metrics like support, confidence, and uplift.
  • +57/-0   
    product_association.md
    Add API reference for product association module.               

    docs/api/product_association.md - Added API reference for `ProductAssociation` class.
    +3/-0     
    product_association.ipynb
    Add example notebook for product association rules.           

    docs/examples/product_association.ipynb
  • Created example notebook for product association rules.
  • Demonstrated usage with sample data.
  • Showcased filtering and analysis capabilities.
  • +679/-0 
    mkdocs.yml
    Update documentation navigation for product association module.

    mkdocs.yml
  • Updated navigation to include product association documentation and
    examples.
  • +2/-0     

    πŸ’‘ PR-Agent usage: Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Summary by CodeRabbit

    coderabbitai[bot] commented 4 months ago

    Walkthrough

    The recent updates introduce a comprehensive framework for product association rules within the retail analytics domain. New documentation and examples enhance understanding of how these rules can optimize sales strategies and customer insights. The ProductAssociation class has been implemented to calculate key metrics like support, confidence, and uplift, supported by tests to ensure reliability. This holistic approach aims to empower retailers with data-driven insights to improve decision-making.

    Changes

    Files Change Summary
    docs/analysis_modules.md Added section on "Product Association Rules" detailing functionalities and metrics in retail analytics.
    docs/api/product_association.md New documentation for the product_association module, explaining its purpose and usage.
    docs/examples/product_association.ipynb Introduced a Jupyter notebook demonstrating the practical application of product association rules.
    mkdocs.yml Updated navigation to include new entries for "Product Association" in Examples and Reference sections.
    pyretailscience/product_association.py Implemented the ProductAssociation class to handle product associations and associated metrics.
    tests/test_product_association.py Created unit tests for the ProductAssociation module to ensure functionality and handle edge cases.

    Sequence Diagram(s)

    sequenceDiagram
        participant Retailer
        participant ProductAssociation
        participant DataFrame
        participant Metrics
    
        Retailer->>DataFrame: Load transaction data
        Retailer->>ProductAssociation: Initialize with DataFrame
        ProductAssociation->>Metrics: Calculate support, confidence, uplift
        Metrics-->>ProductAssociation: Return calculated metrics
        ProductAssociation-->>Retailer: Provide insights on product associations

    πŸ‡ In the land of retail, where sales are the quest,
    A new tool has arrived, it’s simply the best!
    With rules of association, insights take flight,
    Cross-selling and more, making shopping a delight!
    So hop to your data, let metrics unfold,
    With every new purchase, let stories be told! 🌟


    Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

    Share - [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)
    Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.
    codiumai-pr-agent-pro[bot] commented 4 months ago

    PR Reviewer Guide πŸ”

    ⏱️ Estimated effort to review: 3 πŸ”΅πŸ”΅πŸ”΅βšͺβšͺ
    πŸ§ͺ PR contains tests
    πŸ”’ No security concerns identified
    ⚑ Key issues to review

    Possible Bug
    The method `_calc_association` uses a complex series of operations and checks that could be simplified or broken down into smaller, more manageable functions. This would improve readability and maintainability. Performance Concern
    The method `_calc_association` could potentially handle large datasets inefficiently due to the use of dense operations like `toarray()` on sparse matrices. Consider optimizing these operations or exploring more efficient data structures.
    codiumai-pr-agent-pro[bot] commented 4 months ago

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Typo
    βœ… Correct a typo in the documentation text ___ **Consider adding a space between 'of' and 'effective' to correct the typo in the
    text.** [docs/analysis_modules.md [116]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-3dc6510be3b4cf4266ad054e6ce79b1e63a4a65c6199c3e3b5eb62fc2c457419R116-R116) ```diff -Marketing and promotions: Association rules can guide the creation ofeffective bundle offers and promotional campaigns. +Marketing and promotions: Association rules can guide the creation of effective bundle offers and promotional campaigns. ``` `[Suggestion has been applied]`
    Suggestion importance[1-10]: 10 Why: The suggestion corrects a clear typo, improving the readability and professionalism of the documentation.
    10
    Enhancement
    Add explanations for the columns in the example table to aid reader comprehension ___ **Add a brief explanation of the example table columns to enhance understanding for
    readers unfamiliar with the terms used.** [docs/analysis_modules.md [144]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-3dc6510be3b4cf4266ad054e6ce79b1e63a4a65c6199c3e3b5eb62fc2c457419R144-R144) ```diff | product_name_1 | product_name_2 | occurrences_1 | occurrences_2 | cooccurrences | support | confidence | uplift | + ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 9 Why: Adding explanations for the table columns significantly improves the comprehensibility of the example for readers unfamiliar with the terms, enhancing the documentation's utility.
    9
    Improve flexibility by using a variable for the file path ___ **Replace the hardcoded file path with a variable that can be set at the top of the
    notebook. This makes the notebook more flexible and easier to use in different
    environments without modifying the code cells that load data.** [docs/examples/product_association.ipynb [219]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR219-R219) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +data_file_path = "../../data/transactions.parquet" # Set the path to the data file at the top of the notebook +df = pd.read_parquet(data_file_path) ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 7 Why: Using a variable for the file path increases the flexibility and reusability of the notebook, making it easier to adapt to different environments. However, it is a minor enhancement and not crucial for functionality.
    7
    Use loops to generate DataFrame to reduce code repetition and enhance clarity ___ **Use a loop to generate the DataFrame to avoid repetition and improve code clarity.** [tests/test_product_association.py [27-37]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR27-R37) ```diff -return pd.DataFrame({ - "product_1": [ - "beer", "bread", "bread", "bread", "bread", "butter", "butter", "butter", "butter", "diapers", - "eggs", "eggs", "eggs", "eggs", "fruit", "fruit", "fruit", "fruit", "milk", "milk", "milk", - "milk", - ], - "product_2": [ - "diapers", "butter", "eggs", "fruit", "milk", "bread", "eggs", "fruit", "milk", "beer", "bread", - "butter", "fruit", "milk", "bread", "butter", "eggs", "milk", "bread", "butter", "eggs", - "fruit", - ], - ... -}) +products = ["beer", "bread", "butter", "diapers", "eggs", "fruit", "milk"] +data = {"product_1": [], "product_2": []} +for p1 in products: + for p2 in products: + if p1 != p2: + data["product_1"].append(p1) + data["product_2"].append(p2) +return pd.DataFrame(data) ```
    Suggestion importance[1-10]: 3 Why: The suggestion to use loops for generating the DataFrame reduces repetition but oversimplifies the data structure, potentially losing the specific test cases intended by the hardcoded values. The original code provides explicit test data which is crucial for testing specific scenarios.
    3
    Possible bug
    Add a check to ensure the group and value columns are not the same to avoid logical errors in processing ___ **Consider adding a check to ensure that the value_col and group_col are not the same.
    This is important because if both columns are the same, it would lead to incorrect
    calculations of associations, as the same column would be used to identify both the
    product and the transaction/customer, which is logically incorrect and could lead to
    misleading results.** [pyretailscience/product_association.py [132]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-34242b6721500622d77f1e4153020619582b46985c1d0b01411c4c2400b95cb7R132-R132) ```diff required_cols = [group_col, value_col] +if group_col == value_col: + raise ValueError("The group column and value column must be different.") ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 9 Why: This suggestion addresses a potential logical error that could lead to incorrect calculations of associations, which is crucial for the accuracy of the analysis.
    9
    Robustness
    Add error handling to the data loading process ___ **Add error handling for the data loading process to manage cases where the file might
    not exist or is corrupted, enhancing the robustness of the notebook.** [docs/examples/product_association.ipynb [219]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR219-R219) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +try: + df = pd.read_parquet("../../data/transactions.parquet") +except Exception as e: + print(f"An error occurred while loading the data: {e}") ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 9 Why: Adding error handling significantly improves the robustness of the notebook by managing cases where the file might not exist or is corrupted. This is a crucial enhancement for reliability.
    9
    Maintainability
    Encapsulate product association logic into a function for better reusability and testability ___ **Consider using a function to encapsulate the logic for generating product
    association rules, which can then be reused and tested more easily.** [docs/examples/product_association.ipynb [374-381]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR374-R381) ```diff -from pyretailscience.product_association import ProductAssociation +def generate_product_association(df): + from pyretailscience.product_association import ProductAssociation + pa = ProductAssociation( + df, + value_col="product_name", + group_col="transaction_id", + ) + return pa.df.head() -pa = ProductAssociation( - df, - value_col="product_name", - group_col="transaction_id", -) -pa.df.head() +# Example usage: +association_df = generate_product_association(df) +print(association_df) ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 8 Why: Encapsulating the logic into a function improves code maintainability and reusability, making it easier to test and extend. This is a valuable improvement for long-term code management.
    8
    Improve variable naming for better code readability ___ **Consider using a more descriptive variable name instead of 'df' to improve code
    readability and maintainability.** [docs/analysis_modules.md [136-140]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-3dc6510be3b4cf4266ad054e6ce79b1e63a4a65c6199c3e3b5eb62fc2c457419R136-R140) ```diff pa = ProductAssociation( - df, + transaction_data, value_col="product_name", group_col="transaction_id", ) ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 7 Why: Using a more descriptive variable name enhances code readability and maintainability, though it is a minor improvement.
    7
    Refactor the _calc_association method to improve readability and maintainability ___ **To enhance code readability and maintainability, consider refactoring the large
    _calc_association method by extracting parts of the logic into smaller, more focused
    methods. For example, the logic for calculating occurrences and probabilities could
    be moved into a separate method.** [pyretailscience/product_association.py [156-213]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-34242b6721500622d77f1e4153020619582b46985c1d0b01411c4c2400b95cb7R156-R213) ```diff +def _calc_occurrences_and_probabilities(sparse_matrix, row_count): + occurrences = np.array(sparse_matrix.sum(axis=0)).flatten() + occurence_prob = occurrences / row_count + return occurrences, occurence_prob + def _calc_association( df: pd.DataFrame, value_col: str, group_col: str = "customer_id", target_item: str | None = None, number_of_combinations: Literal[2, 3] = 2, min_occurrences: int = 1, min_cooccurrences: int = 1, min_support: float = 0.0, min_confidence: float = 0.0, min_uplift: float = 0.0, ) -> pd.DataFrame: + # Existing code with calls to the new method where appropriate ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 6 Why: This suggestion enhances code readability and maintainability by breaking down a large method into smaller, more focused methods, which is beneficial for long-term maintenance.
    6
    Improve maintainability by using a fixture for sample data ___ **Replace the hardcoded DataFrame creation with a fixture function to improve
    maintainability and reusability.** [tests/test_product_association.py [16-20]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR16-R20) ```diff -return pd.DataFrame({ - "transaction_id": [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5], - "product": ["milk", "bread", "fruit", "butter", "eggs", "fruit", "beer", "diapers", - "milk", "bread", "butter", "eggs", "fruit", "bread"], -}) +return self.sample_transactions_df() ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 4 Why: While using a fixture function can improve maintainability, the suggestion does not provide the implementation of `self.sample_transactions_df()`, making it unclear how it would be integrated. Additionally, the current hardcoded DataFrame is simple and clear enough for the test context.
    4
    Enhance test isolation and reusability by using a helper function for DataFrame creation ___ **Refactor the DataFrame creation to use a helper function for generating test data,
    enhancing test isolation and reusability.** [tests/test_product_association.py [64-71]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR64-R71) ```diff -return pd.DataFrame({ - "product_1": [ - ("bread", "butter"), ("bread", "butter"), ("bread", "butter"), ("bread", "eggs"), ("bread", "eggs"), - ("bread", "eggs"), ("bread", "fruit"), ("bread", "fruit"), ("bread", "fruit"), ("bread", "milk"), - ("bread", "milk"), ("bread", "milk"), ("butter", "eggs"), ("butter", "eggs"), ("butter", "eggs"), - ("butter", "fruit"), ("butter", "fruit"), ("butter", "fruit"), ("butter", "milk"), - ("butter", "milk"), ("butter", "milk"), ("eggs", "fruit"), ("eggs", "fruit"), ("eggs", "fruit"), - ("eggs", "milk"), ("eggs", "milk"), ("eggs", "milk"), ("fruit", "milk"), ("fruit", "milk"), - ("fruit", "milk"), - ], - ... -}) +return self.generate_pair_items_df() ```
    Suggestion importance[1-10]: 4 Why: Similar to the first suggestion, using a helper function can improve maintainability, but the suggestion lacks the implementation details of `self.generate_pair_items_df()`. The current hardcoded DataFrame is clear and specific for the test cases.
    4
    Performance
    Use coo_matrix for efficient sparse matrix creation and convert to csr_matrix if necessary ___ **To improve the efficiency of the sparse matrix creation, consider using the
    coo_matrix instead of csr_matrix for the initial creation, as coo_matrix is more
    efficient for constructing matrices incrementally. This can be converted to
    csr_matrix afterwards if needed for further operations that require fast row
    slicing.** [pyretailscience/product_association.py [231-238]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-34242b6721500622d77f1e4153020619582b46985c1d0b01411c4c2400b95cb7R231-R238) ```diff -sparse_matrix = csr_matrix( +from scipy.sparse import coo_matrix +sparse_matrix = coo_matrix( ( [1] * len(unique_combo_df), ( unique_combo_df[group_col].cat.codes, unique_combo_df[value_col].cat.codes, ), ), -) +).tocsr() ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 7 Why: This suggestion improves performance by using a more efficient matrix construction method, which is beneficial but not critical for correctness.
    7
    Best practice
    Improve variable naming for clarity and maintainability ___ **Use more descriptive variable names in the print statements to enhance code
    readability and maintainability.** [docs/examples/product_association.ipynb [238-239]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR238-R239) ```diff -print(f"Number of unique customers: {df['customer_id'].nunique()}") -print(f"Number of unique transactions: {df['transaction_id'].nunique()}") +num_unique_customers = df['customer_id'].nunique() +num_unique_transactions = df['transaction_id'].nunique() +print(f"Number of unique customers: {num_unique_customers}") +print(f"Number of unique transactions: {num_unique_transactions}") ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 6 Why: Using more descriptive variable names enhances code readability and maintainability. This is a good practice but is a minor improvement in terms of overall impact.
    6
    Use list comprehensions for creating DataFrame columns to make the code more concise ___ **Use list comprehensions for more concise and Pythonic code when creating DataFrame
    columns.** [tests/test_product_association.py [39-40]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR39-R40) ```diff +num_items = 22 # Adjust as necessary return pd.DataFrame({ - "occurrences_1": [1, 3, 3, 3, 3, 2, 2, 2, 2, 1, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 2], - "occurrences_2": [1, 2, 2, 3, 2, 3, 2, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3], + "occurrences_1": [random.randint(1, 3) for _ in range(num_items)], + "occurrences_2": [random.randint(1, 3) for _ in range(num_items)], ... }) ```
    Suggestion importance[1-10]: 2 Why: Using list comprehensions with random values does not preserve the specific test cases intended by the hardcoded values. The original explicit values are necessary for ensuring the tests cover the expected scenarios accurately.
    2
    codecov[bot] commented 4 months ago

    Codecov Report

    Attention: Patch coverage is 88.70968% with 7 lines in your changes missing coverage. Please review.

    Files Patch % Lines
    pyretailscience/product_association.py 88.70% 6 Missing and 1 partial :warning:
    Flag Coverage Ξ”
    service ?

    Flags with carried forward coverage won't be shown. Click here to find out more.

    Files Coverage Ξ”
    pyretailscience/product_association.py 88.70% <88.70%> (ΓΈ)

    ... and 8 files with indirect coverage changes