feat: added production association rule module

mvanwyk commented 4 months ago

PR Type

Enhancement, Documentation, Tests

Description

Implemented the ProductAssociation class for generating product association rules.
Added methods to calculate support, confidence, and uplift metrics.
Included validation for input parameters and data.
Added comprehensive tests for the ProductAssociation class.
Documented the module with examples, use cases, and API reference.
Updated documentation navigation to include the new module.

Changes walkthrough 📝

Relevant files

Enhancement

product_association.py

Implement product association rules generation module.

pyretailscience/product_association.py

Added ProductAssociation class for generating product association
rules.

Implemented methods to calculate support, confidence, and uplift
metrics.

Included validation for input parameters and data.

+304/-0

Tests

test_product_association.py

Add tests for product association rules module.

tests/test_product_association.py

Added tests for ProductAssociation class.

Included fixtures for sample data and expected results.

Tested various configurations and edge cases.

+330/-0

Documentation

analysis_modules.md `Document product association rules module.` docs/analysis_modules.md Documented the product association rules module. Provided examples and use cases. Explained metrics like support, confidence, and uplift.	+57/-0
product_association.md `Add API reference for product association module.` docs/api/product_association.md - Added API reference for `ProductAssociation` class.	+3/-0
product_association.ipynb `Add example notebook for product association rules.` docs/examples/product_association.ipynb Created example notebook for product association rules. Demonstrated usage with sample data. Showcased filtering and analysis capabilities.	+679/-0
mkdocs.yml `Update documentation navigation for product association module.` mkdocs.yml Updated navigation to include product association documentation and examples.	+2/-0

💡 PR-Agent usage: Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

Summary by CodeRabbit

New Features
- Introduction of a "Product Association Rules" section in the documentation, detailing applications in retail analytics.
- New documentation file for the product_association module, enhancing user understanding of product associations.
- Addition of an example notebook demonstrating the practical implementation of product association rules.
Documentation
- Expanded navigation structure in documentation to include new sections for product association examples and API references.
Tests
- Added a comprehensive suite of unit tests for the ProductAssociation module to ensure functionality and reliability.

coderabbitai[bot] commented 4 months ago

Walkthrough

The recent updates introduce a comprehensive framework for product association rules within the retail analytics domain. New documentation and examples enhance understanding of how these rules can optimize sales strategies and customer insights. The ProductAssociation class has been implemented to calculate key metrics like support, confidence, and uplift, supported by tests to ensure reliability. This holistic approach aims to empower retailers with data-driven insights to improve decision-making.

Changes

Files	Change Summary
`docs/analysis_modules.md`	Added section on "Product Association Rules" detailing functionalities and metrics in retail analytics.
`docs/api/product_association.md`	New documentation for the `product_association` module, explaining its purpose and usage.
`docs/examples/product_association.ipynb`	Introduced a Jupyter notebook demonstrating the practical application of product association rules.
`mkdocs.yml`	Updated navigation to include new entries for "Product Association" in Examples and Reference sections.
`pyretailscience/product_association.py`	Implemented the `ProductAssociation` class to handle product associations and associated metrics.
`tests/test_product_association.py`	Created unit tests for the `ProductAssociation` module to ensure functionality and handle edge cases.

Sequence Diagram(s)

sequenceDiagram
    participant Retailer
    participant ProductAssociation
    participant DataFrame
    participant Metrics

    Retailer->>DataFrame: Load transaction data
    Retailer->>ProductAssociation: Initialize with DataFrame
    ProductAssociation->>Metrics: Calculate support, confidence, uplift
    Metrics-->>ProductAssociation: Return calculated metrics
    ProductAssociation-->>Retailer: Provide insights on product associations

🐇 In the land of retail, where sales are the quest,
A new tool has arrived, it’s simply the best!
With rules of association, insights take flight,
Cross-selling and more, making shopping a delight!
So hop to your data, let metrics unfold,
With every new purchase, let stories be told! 🌟

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

Tips

### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

codiumai-pr-agent-pro[bot] commented 4 months ago

PR Reviewer Guide 🔍

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪

🧪 PR contains tests

🔒 No security concerns identified

⚡ Key issues to review

Possible Bug
The method `_calc_association` uses a complex series of operations and checks that could be simplified or broken down into smaller, more manageable functions. This would improve readability and maintainability. Performance Concern
The method `_calc_association` could potentially handle large datasets inefficiently due to the use of dense operations like `toarray()` on sparse matrices. Consider optimizing these operations or exploring more efficient data structures.

codiumai-pr-agent-pro[bot] commented 4 months ago

PR Code Suggestions ✨

Category	Suggestion	Score
Typo	✅ Correct a typo in the documentation text ___ Consider adding a space between 'of' and 'effective' to correct the typo in the text. [docs/analysis_modules.md [116]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-3dc6510be3b4cf4266ad054e6ce79b1e63a4a65c6199c3e3b5eb62fc2c457419R116-R116) ```diff -Marketing and promotions: Association rules can guide the creation ofeffective bundle offers and promotional campaigns. +Marketing and promotions: Association rules can guide the creation of effective bundle offers and promotional campaigns. ``` `[Suggestion has been applied]` Suggestion importance[1-10]: 10 Why: The suggestion corrects a clear typo, improving the readability and professionalism of the documentation.	10
Enhancement	Add explanations for the columns in the example table to aid reader comprehension ___ Add a brief explanation of the example table columns to enhance understanding for readers unfamiliar with the terms used. [docs/analysis_modules.md [144]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-3dc6510be3b4cf4266ad054e6ce79b1e63a4a65c6199c3e3b5eb62fc2c457419R144-R144) ```diff \| product_name_1 \| product_name_2 \| occurrences_1 \| occurrences_2 \| cooccurrences \| support \| confidence \| uplift \| + ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 9 Why: Adding explanations for the table columns significantly improves the comprehensibility of the example for readers unfamiliar with the terms, enhancing the documentation's utility.	9
	Improve flexibility by using a variable for the file path ___ Replace the hardcoded file path with a variable that can be set at the top of the notebook. This makes the notebook more flexible and easier to use in different environments without modifying the code cells that load data. [docs/examples/product_association.ipynb [219]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR219-R219) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +data_file_path = "../../data/transactions.parquet" # Set the path to the data file at the top of the notebook +df = pd.read_parquet(data_file_path) ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 7 Why: Using a variable for the file path increases the flexibility and reusability of the notebook, making it easier to adapt to different environments. However, it is a minor enhancement and not crucial for functionality.	7
	Use loops to generate DataFrame to reduce code repetition and enhance clarity ___ Use a loop to generate the DataFrame to avoid repetition and improve code clarity. [tests/test_product_association.py [27-37]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR27-R37) ```diff -return pd.DataFrame({ - "product_1": [ - "beer", "bread", "bread", "bread", "bread", "butter", "butter", "butter", "butter", "diapers", - "eggs", "eggs", "eggs", "eggs", "fruit", "fruit", "fruit", "fruit", "milk", "milk", "milk", - "milk", - ], - "product_2": [ - "diapers", "butter", "eggs", "fruit", "milk", "bread", "eggs", "fruit", "milk", "beer", "bread", - "butter", "fruit", "milk", "bread", "butter", "eggs", "milk", "bread", "butter", "eggs", - "fruit", - ], - ... -}) +products = ["beer", "bread", "butter", "diapers", "eggs", "fruit", "milk"] +data = {"product_1": [], "product_2": []} +for p1 in products: + for p2 in products: + if p1 != p2: + data["product_1"].append(p1) + data["product_2"].append(p2) +return pd.DataFrame(data) ``` Suggestion importance[1-10]: 3 Why: The suggestion to use loops for generating the DataFrame reduces repetition but oversimplifies the data structure, potentially losing the specific test cases intended by the hardcoded values. The original code provides explicit test data which is crucial for testing specific scenarios.	3
Possible bug	Add a check to ensure the group and value columns are not the same to avoid logical errors in processing ___ Consider adding a check to ensure that the `value_col` and `group_col` are not the same. This is important because if both columns are the same, it would lead to incorrect calculations of associations, as the same column would be used to identify both the product and the transaction/customer, which is logically incorrect and could lead to misleading results. [pyretailscience/product_association.py [132]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-34242b6721500622d77f1e4153020619582b46985c1d0b01411c4c2400b95cb7R132-R132) ```diff required_cols = [group_col, value_col] +if group_col == value_col: + raise ValueError("The group column and value column must be different.") ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 9 Why: This suggestion addresses a potential logical error that could lead to incorrect calculations of associations, which is crucial for the accuracy of the analysis.	9
Robustness	Add error handling to the data loading process ___ Add error handling for the data loading process to manage cases where the file might not exist or is corrupted, enhancing the robustness of the notebook. [docs/examples/product_association.ipynb [219]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR219-R219) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +try: + df = pd.read_parquet("../../data/transactions.parquet") +except Exception as e: + print(f"An error occurred while loading the data: {e}") ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 9 Why: Adding error handling significantly improves the robustness of the notebook by managing cases where the file might not exist or is corrupted. This is a crucial enhancement for reliability.	9
Maintainability	Encapsulate product association logic into a function for better reusability and testability ___ Consider using a function to encapsulate the logic for generating product association rules, which can then be reused and tested more easily. [docs/examples/product_association.ipynb [374-381]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR374-R381) ```diff -from pyretailscience.product_association import ProductAssociation +def generate_product_association(df): + from pyretailscience.product_association import ProductAssociation + pa = ProductAssociation( + df, + value_col="product_name", + group_col="transaction_id", + ) + return pa.df.head() -pa = ProductAssociation( - df, - value_col="product_name", - group_col="transaction_id", -) -pa.df.head() +# Example usage: +association_df = generate_product_association(df) +print(association_df) ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 8 Why: Encapsulating the logic into a function improves code maintainability and reusability, making it easier to test and extend. This is a valuable improvement for long-term code management.	8
	Improve variable naming for better code readability ___ Consider using a more descriptive variable name instead of 'df' to improve code readability and maintainability. [docs/analysis_modules.md [136-140]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-3dc6510be3b4cf4266ad054e6ce79b1e63a4a65c6199c3e3b5eb62fc2c457419R136-R140) ```diff pa = ProductAssociation( - df, + transaction_data, value_col="product_name", group_col="transaction_id", ) ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 7 Why: Using a more descriptive variable name enhances code readability and maintainability, though it is a minor improvement.	7
	Refactor the `_calc_association` method to improve readability and maintainability ___ To enhance code readability and maintainability, consider refactoring the large `_calc_association` method by extracting parts of the logic into smaller, more focused methods. For example, the logic for calculating occurrences and probabilities could be moved into a separate method. [pyretailscience/product_association.py [156-213]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-34242b6721500622d77f1e4153020619582b46985c1d0b01411c4c2400b95cb7R156-R213) ```diff +def _calc_occurrences_and_probabilities(sparse_matrix, row_count): + occurrences = np.array(sparse_matrix.sum(axis=0)).flatten() + occurence_prob = occurrences / row_count + return occurrences, occurence_prob + def _calc_association( df: pd.DataFrame, value_col: str, group_col: str = "customer_id", target_item: str \| None = None, number_of_combinations: Literal[2, 3] = 2, min_occurrences: int = 1, min_cooccurrences: int = 1, min_support: float = 0.0, min_confidence: float = 0.0, min_uplift: float = 0.0, ) -> pd.DataFrame: + # Existing code with calls to the new method where appropriate ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 6 Why: This suggestion enhances code readability and maintainability by breaking down a large method into smaller, more focused methods, which is beneficial for long-term maintenance.	6
	Improve maintainability by using a fixture for sample data ___ Replace the hardcoded DataFrame creation with a fixture function to improve maintainability and reusability. [tests/test_product_association.py [16-20]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR16-R20) ```diff -return pd.DataFrame({ - "transaction_id": [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5], - "product": ["milk", "bread", "fruit", "butter", "eggs", "fruit", "beer", "diapers", - "milk", "bread", "butter", "eggs", "fruit", "bread"], -}) +return self.sample_transactions_df() ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 4 Why: While using a fixture function can improve maintainability, the suggestion does not provide the implementation of `self.sample_transactions_df()`, making it unclear how it would be integrated. Additionally, the current hardcoded DataFrame is simple and clear enough for the test context.	4
	Enhance test isolation and reusability by using a helper function for DataFrame creation ___ Refactor the DataFrame creation to use a helper function for generating test data, enhancing test isolation and reusability. [tests/test_product_association.py [64-71]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR64-R71) ```diff -return pd.DataFrame({ - "product_1": [ - ("bread", "butter"), ("bread", "butter"), ("bread", "butter"), ("bread", "eggs"), ("bread", "eggs"), - ("bread", "eggs"), ("bread", "fruit"), ("bread", "fruit"), ("bread", "fruit"), ("bread", "milk"), - ("bread", "milk"), ("bread", "milk"), ("butter", "eggs"), ("butter", "eggs"), ("butter", "eggs"), - ("butter", "fruit"), ("butter", "fruit"), ("butter", "fruit"), ("butter", "milk"), - ("butter", "milk"), ("butter", "milk"), ("eggs", "fruit"), ("eggs", "fruit"), ("eggs", "fruit"), - ("eggs", "milk"), ("eggs", "milk"), ("eggs", "milk"), ("fruit", "milk"), ("fruit", "milk"), - ("fruit", "milk"), - ], - ... -}) +return self.generate_pair_items_df() ``` Suggestion importance[1-10]: 4 Why: Similar to the first suggestion, using a helper function can improve maintainability, but the suggestion lacks the implementation details of `self.generate_pair_items_df()`. The current hardcoded DataFrame is clear and specific for the test cases.	4
Performance	Use `coo_matrix` for efficient sparse matrix creation and convert to `csr_matrix` if necessary ___ To improve the efficiency of the sparse matrix creation, consider using the `coo_matrix` instead of `csr_matrix` for the initial creation, as `coo_matrix` is more efficient for constructing matrices incrementally. This can be converted to `csr_matrix` afterwards if needed for further operations that require fast row slicing. [pyretailscience/product_association.py [231-238]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-34242b6721500622d77f1e4153020619582b46985c1d0b01411c4c2400b95cb7R231-R238) ```diff -sparse_matrix = csr_matrix( +from scipy.sparse import coo_matrix +sparse_matrix = coo_matrix( ( [1] * len(unique_combo_df), ( unique_combo_df[group_col].cat.codes, unique_combo_df[value_col].cat.codes, ), ), -) +).tocsr() ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 7 Why: This suggestion improves performance by using a more efficient matrix construction method, which is beneficial but not critical for correctness.	7
Best practice	Improve variable naming for clarity and maintainability ___ Use more descriptive variable names in the print statements to enhance code readability and maintainability. [docs/examples/product_association.ipynb [238-239]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR238-R239) ```diff -print(f"Number of unique customers: {df['customer_id'].nunique()}") -print(f"Number of unique transactions: {df['transaction_id'].nunique()}") +num_unique_customers = df['customer_id'].nunique() +num_unique_transactions = df['transaction_id'].nunique() +print(f"Number of unique customers: {num_unique_customers}") +print(f"Number of unique transactions: {num_unique_transactions}") ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 6 Why: Using more descriptive variable names enhances code readability and maintainability. This is a good practice but is a minor improvement in terms of overall impact.	6
Best practice	Use list comprehensions for creating DataFrame columns to make the code more concise ___ Use list comprehensions for more concise and Pythonic code when creating DataFrame columns. [tests/test_product_association.py [39-40]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR39-R40) ```diff +num_items = 22 # Adjust as necessary return pd.DataFrame({ - "occurrences_1": [1, 3, 3, 3, 3, 2, 2, 2, 2, 1, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 2], - "occurrences_2": [1, 2, 2, 3, 2, 3, 2, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3], + "occurrences_1": [random.randint(1, 3) for _ in range(num_items)], + "occurrences_2": [random.randint(1, 3) for _ in range(num_items)], ... }) ``` Suggestion importance[1-10]: 2 Why: Using list comprehensions with random values does not preserve the specific test cases intended by the hardcoded values. The original explicit values are necessary for ensuring the tests cover the expected scenarios accurately.	2

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 88.70968% with 7 lines in your changes missing coverage. Please review.

Files	Patch %	Lines
pyretailscience/product_association.py	88.70%	6 Missing and 1 partial :warning:

Flag	Coverage Δ
service	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
pyretailscience/product_association.py	`88.70% <88.70%> (ø)`

... and 8 files with indirect coverage changes

Category	Suggestion	Score
Typo	✅ Correct a typo in the documentation text ___ Consider adding a space between 'of' and 'effective' to correct the typo in the text. [docs/analysis_modules.md [116]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-3dc6510be3b4cf4266ad054e6ce79b1e63a4a65c6199c3e3b5eb62fc2c457419R116-R116) ```diff -Marketing and promotions: Association rules can guide the creation ofeffective bundle offers and promotional campaigns. +Marketing and promotions: Association rules can guide the creation of effective bundle offers and promotional campaigns. ``` `[Suggestion has been applied]` Suggestion importance[1-10]: 10 Why: The suggestion corrects a clear typo, improving the readability and professionalism of the documentation.	10
Enhancement	Add explanations for the columns in the example table to aid reader comprehension ___ Add a brief explanation of the example table columns to enhance understanding for readers unfamiliar with the terms used. [docs/analysis_modules.md [144]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-3dc6510be3b4cf4266ad054e6ce79b1e63a4a65c6199c3e3b5eb62fc2c457419R144-R144) ```diff \| product_name_1 \| product_name_2 \| occurrences_1 \| occurrences_2 \| cooccurrences \| support \| confidence \| uplift \| + ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 9 Why: Adding explanations for the table columns significantly improves the comprehensibility of the example for readers unfamiliar with the terms, enhancing the documentation's utility.	9
	Improve flexibility by using a variable for the file path ___ Replace the hardcoded file path with a variable that can be set at the top of the notebook. This makes the notebook more flexible and easier to use in different environments without modifying the code cells that load data. [docs/examples/product_association.ipynb [219]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR219-R219) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +data_file_path = "../../data/transactions.parquet" # Set the path to the data file at the top of the notebook +df = pd.read_parquet(data_file_path) ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 7 Why: Using a variable for the file path increases the flexibility and reusability of the notebook, making it easier to adapt to different environments. However, it is a minor enhancement and not crucial for functionality.	7
	Use loops to generate DataFrame to reduce code repetition and enhance clarity ___ Use a loop to generate the DataFrame to avoid repetition and improve code clarity. [tests/test_product_association.py [27-37]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR27-R37) ```diff -return pd.DataFrame({ - "product_1": [ - "beer", "bread", "bread", "bread", "bread", "butter", "butter", "butter", "butter", "diapers", - "eggs", "eggs", "eggs", "eggs", "fruit", "fruit", "fruit", "fruit", "milk", "milk", "milk", - "milk", - ], - "product_2": [ - "diapers", "butter", "eggs", "fruit", "milk", "bread", "eggs", "fruit", "milk", "beer", "bread", - "butter", "fruit", "milk", "bread", "butter", "eggs", "milk", "bread", "butter", "eggs", - "fruit", - ], - ... -}) +products = ["beer", "bread", "butter", "diapers", "eggs", "fruit", "milk"] +data = {"product_1": [], "product_2": []} +for p1 in products: + for p2 in products: + if p1 != p2: + data["product_1"].append(p1) + data["product_2"].append(p2) +return pd.DataFrame(data) ``` Suggestion importance[1-10]: 3 Why: The suggestion to use loops for generating the DataFrame reduces repetition but oversimplifies the data structure, potentially losing the specific test cases intended by the hardcoded values. The original code provides explicit test data which is crucial for testing specific scenarios.	3
Possible bug	Add a check to ensure the group and value columns are not the same to avoid logical errors in processing ___ Consider adding a check to ensure that the `value_col` and `group_col` are not the same. This is important because if both columns are the same, it would lead to incorrect calculations of associations, as the same column would be used to identify both the product and the transaction/customer, which is logically incorrect and could lead to misleading results. [pyretailscience/product_association.py [132]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-34242b6721500622d77f1e4153020619582b46985c1d0b01411c4c2400b95cb7R132-R132) ```diff required_cols = [group_col, value_col] +if group_col == value_col: + raise ValueError("The group column and value column must be different.") ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 9 Why: This suggestion addresses a potential logical error that could lead to incorrect calculations of associations, which is crucial for the accuracy of the analysis.	9
Robustness	Add error handling to the data loading process ___ Add error handling for the data loading process to manage cases where the file might not exist or is corrupted, enhancing the robustness of the notebook. [docs/examples/product_association.ipynb [219]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR219-R219) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +try: + df = pd.read_parquet("../../data/transactions.parquet") +except Exception as e: + print(f"An error occurred while loading the data: {e}") ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 9 Why: Adding error handling significantly improves the robustness of the notebook by managing cases where the file might not exist or is corrupted. This is a crucial enhancement for reliability.	9
Maintainability	Encapsulate product association logic into a function for better reusability and testability ___ Consider using a function to encapsulate the logic for generating product association rules, which can then be reused and tested more easily. [docs/examples/product_association.ipynb [374-381]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR374-R381) ```diff -from pyretailscience.product_association import ProductAssociation +def generate_product_association(df): + from pyretailscience.product_association import ProductAssociation + pa = ProductAssociation( + df, + value_col="product_name", + group_col="transaction_id", + ) + return pa.df.head() -pa = ProductAssociation( - df, - value_col="product_name", - group_col="transaction_id", -) -pa.df.head() +# Example usage: +association_df = generate_product_association(df) +print(association_df) ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 8 Why: Encapsulating the logic into a function improves code maintainability and reusability, making it easier to test and extend. This is a valuable improvement for long-term code management.	8
	Improve variable naming for better code readability ___ Consider using a more descriptive variable name instead of 'df' to improve code readability and maintainability. [docs/analysis_modules.md [136-140]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-3dc6510be3b4cf4266ad054e6ce79b1e63a4a65c6199c3e3b5eb62fc2c457419R136-R140) ```diff pa = ProductAssociation( - df, + transaction_data, value_col="product_name", group_col="transaction_id", ) ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 7 Why: Using a more descriptive variable name enhances code readability and maintainability, though it is a minor improvement.	7
	Refactor the `_calc_association` method to improve readability and maintainability ___ To enhance code readability and maintainability, consider refactoring the large `_calc_association` method by extracting parts of the logic into smaller, more focused methods. For example, the logic for calculating occurrences and probabilities could be moved into a separate method. [pyretailscience/product_association.py [156-213]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-34242b6721500622d77f1e4153020619582b46985c1d0b01411c4c2400b95cb7R156-R213) ```diff +def _calc_occurrences_and_probabilities(sparse_matrix, row_count): + occurrences = np.array(sparse_matrix.sum(axis=0)).flatten() + occurence_prob = occurrences / row_count + return occurrences, occurence_prob + def _calc_association( df: pd.DataFrame, value_col: str, group_col: str = "customer_id", target_item: str \| None = None, number_of_combinations: Literal[2, 3] = 2, min_occurrences: int = 1, min_cooccurrences: int = 1, min_support: float = 0.0, min_confidence: float = 0.0, min_uplift: float = 0.0, ) -> pd.DataFrame: + # Existing code with calls to the new method where appropriate ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 6 Why: This suggestion enhances code readability and maintainability by breaking down a large method into smaller, more focused methods, which is beneficial for long-term maintenance.	6
	Improve maintainability by using a fixture for sample data ___ Replace the hardcoded DataFrame creation with a fixture function to improve maintainability and reusability. [tests/test_product_association.py [16-20]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR16-R20) ```diff -return pd.DataFrame({ - "transaction_id": [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5], - "product": ["milk", "bread", "fruit", "butter", "eggs", "fruit", "beer", "diapers", - "milk", "bread", "butter", "eggs", "fruit", "bread"], -}) +return self.sample_transactions_df() ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 4 Why: While using a fixture function can improve maintainability, the suggestion does not provide the implementation of `self.sample_transactions_df()`, making it unclear how it would be integrated. Additionally, the current hardcoded DataFrame is simple and clear enough for the test context.	4
	Enhance test isolation and reusability by using a helper function for DataFrame creation ___ Refactor the DataFrame creation to use a helper function for generating test data, enhancing test isolation and reusability. [tests/test_product_association.py [64-71]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR64-R71) ```diff -return pd.DataFrame({ - "product_1": [ - ("bread", "butter"), ("bread", "butter"), ("bread", "butter"), ("bread", "eggs"), ("bread", "eggs"), - ("bread", "eggs"), ("bread", "fruit"), ("bread", "fruit"), ("bread", "fruit"), ("bread", "milk"), - ("bread", "milk"), ("bread", "milk"), ("butter", "eggs"), ("butter", "eggs"), ("butter", "eggs"), - ("butter", "fruit"), ("butter", "fruit"), ("butter", "fruit"), ("butter", "milk"), - ("butter", "milk"), ("butter", "milk"), ("eggs", "fruit"), ("eggs", "fruit"), ("eggs", "fruit"), - ("eggs", "milk"), ("eggs", "milk"), ("eggs", "milk"), ("fruit", "milk"), ("fruit", "milk"), - ("fruit", "milk"), - ], - ... -}) +return self.generate_pair_items_df() ``` Suggestion importance[1-10]: 4 Why: Similar to the first suggestion, using a helper function can improve maintainability, but the suggestion lacks the implementation details of `self.generate_pair_items_df()`. The current hardcoded DataFrame is clear and specific for the test cases.	4
Performance	Use `coo_matrix` for efficient sparse matrix creation and convert to `csr_matrix` if necessary ___ To improve the efficiency of the sparse matrix creation, consider using the `coo_matrix` instead of `csr_matrix` for the initial creation, as `coo_matrix` is more efficient for constructing matrices incrementally. This can be converted to `csr_matrix` afterwards if needed for further operations that require fast row slicing. [pyretailscience/product_association.py [231-238]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-34242b6721500622d77f1e4153020619582b46985c1d0b01411c4c2400b95cb7R231-R238) ```diff -sparse_matrix = csr_matrix( +from scipy.sparse import coo_matrix +sparse_matrix = coo_matrix( ( [1] * len(unique_combo_df), ( unique_combo_df[group_col].cat.codes, unique_combo_df[value_col].cat.codes, ), ), -) +).tocsr() ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 7 Why: This suggestion improves performance by using a more efficient matrix construction method, which is beneficial but not critical for correctness.	7
Best practice	Improve variable naming for clarity and maintainability ___ Use more descriptive variable names in the print statements to enhance code readability and maintainability. [docs/examples/product_association.ipynb [238-239]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-8db64798c05c605c036db9aca423a51caf447ca68d5bfb3911687457fbf2b1cdR238-R239) ```diff -print(f"Number of unique customers: {df['customer_id'].nunique()}") -print(f"Number of unique transactions: {df['transaction_id'].nunique()}") +num_unique_customers = df['customer_id'].nunique() +num_unique_transactions = df['transaction_id'].nunique() +print(f"Number of unique customers: {num_unique_customers}") +print(f"Number of unique transactions: {num_unique_transactions}") ``` - [ ] Apply this suggestion Suggestion importance[1-10]: 6 Why: Using more descriptive variable names enhances code readability and maintainability. This is a good practice but is a minor improvement in terms of overall impact.	6
Best practice	Use list comprehensions for creating DataFrame columns to make the code more concise ___ Use list comprehensions for more concise and Pythonic code when creating DataFrame columns. [tests/test_product_association.py [39-40]](https://github.com/Data-Simply/pyretailscience/pull/69/files#diff-c57730f5b3bb551a6f2032f013fa8a09ea293e9451b39178e75824985aa82cedR39-R40) ```diff +num_items = 22 # Adjust as necessary return pd.DataFrame({ - "occurrences_1": [1, 3, 3, 3, 3, 2, 2, 2, 2, 1, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 2], - "occurrences_2": [1, 2, 2, 3, 2, 3, 2, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3], + "occurrences_1": [random.randint(1, 3) for _ in range(num_items)], + "occurrences_2": [random.randint(1, 3) for _ in range(num_items)], ... }) ``` Suggestion importance[1-10]: 2 Why: Using list comprehensions with random values does not preserve the specific test cases intended by the hardcoded values. The original explicit values are necessary for ensuring the tests cover the expected scenarios accurately.	2

Data-Simply / pyretailscience