Data-Simply / pyretailscience

pyretailscience - A data analysis and science toolkit for detail data
Other
5 stars 1 forks source link

refactor: move data simulation to another package #55

Closed mvanwyk closed 5 months ago

mvanwyk commented 5 months ago

PR Type

Enhancement, Documentation


Description


Changes walkthrough ๐Ÿ“

Relevant files
Enhancement
data_contracts.ipynb
Refactor data contracts example to load data from parquet file

docs/examples/data_contracts.ipynb
  • Replaced data simulation code with data loading from a parquet file.
  • Updated transaction data examples to reflect new data.
  • Improved exception handling in top_customers function.
  • Added type annotations and docstrings for better clarity.
  • +101/-110
    retention.ipynb
    Refactor retention example to load data from parquet file

    docs/examples/retention.ipynb
  • Removed data simulation setup.
  • Added data loading from a parquet file.
  • Updated transaction data examples to reflect new data.
  • Minor formatting improvements.
  • +173/-33
    gain_loss.ipynb
    Refactor gain/loss example to load data from parquet file

    docs/examples/gain_loss.ipynb
  • Replaced data simulation code with data loading from a parquet file.
  • Updated transaction data examples to reflect new data.
  • Minor formatting improvements.
  • +66/-41 
    cross_shop.ipynb
    Refactor cross-shop example to load data from parquet file

    docs/examples/cross_shop.ipynb
  • Replaced data simulation code with data loading from a parquet file.
  • Updated transaction data examples to reflect new data.
  • Minor formatting improvements.
  • +67/-42 
    Documentation
    README.md
    Update README to remove data simulation instructions         

    README.md
  • Removed section on generating simulated data.
  • Added placeholder text for future updates.
  • +1/-27   
    Configuration changes
    mkdocs.yml
    Update mkdocs configuration to reflect new examples structure

    mkdocs.yml
  • Reorganized examples section.
  • Removed reference to data simulation example.
  • +1/-3     
    Additional files (token-limit)
    segmentation.ipynb
    ...                                                                                                           

    docs/examples/segmentation.ipynb ...
    +294/-269

    ๐Ÿ’ก PR-Agent usage: Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Summary by CodeRabbit

    coderabbitai[bot] commented 5 months ago

    Walkthrough

    The recent changes primarily focus on shifting the data workflow from generating simulated data to loading pre-simulated data. This impacts multiple notebooks and documentation files, altering the instructions and examples accordingly. Additionally, the .gitignore file was updated to exclude .csv instead of .parquet files, and the pyproject.toml was modified to remove certain dependencies and script entries. Navigation in mkdocs.yml was also restructured for better clarity and organization.

    Changes

    Files Change Summary
    .gitignore Updated to exclude .csv instead of *.parquet files.
    README.md Removed the section on generating simulated transaction data; replaced with "Coming Soon."
    docs/examples/cross_shop.ipynb Changed from simulating data to loading pre-simulated data; updated displayed data.
    docs/examples/data_contracts.ipynb Updated text and functionality for loading data; added new class and type hints in function parameters.
    docs/examples/gain_loss.ipynb Switched from simulating to loading pre-simulated data; updated brand names and prices.
    docs/examples/retention.ipynb Significant changes to load data from a file; included new imports and updated output visualizations.
    โ€ฆ/examples/โ€ฆ (multiple files) Grouped similar changes across multiple notebook files for brevity.
    mkdocs.yml Rearranged the navigation structure; removed outdated sections and links.
    pyproject.toml Removed click dependency; reordered some package versions; removed a script entry.

    Poem

    In the realm where data flows,
    Files transformed and notebooks glowed,
    From simulating days to pre-simulated ways,
    Cleaner paths now boldly showed.
    CSVs we shall hide,
    In structured lines, our progress pried.
    ๐ŸŒŸ๐Ÿš€ A celebratory leap, with code we keep! ๐Ÿš€๐ŸŒŸ

    [!TIP]

    AI model upgrade ## `gpt-4o` model for reviews and chat is now live OpenAI claims that this model is better at understanding and generating code than the previous models. Please join our [Discord Community](https://discord.com/invite/GsXnASn26c) to provide any feedback or to report any issues.

    Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

    Share - [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)
    Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.
    codiumai-pr-agent-pro[bot] commented 5 months ago

    PR Reviewer Guide ๐Ÿ”

    โฑ๏ธ Estimated effort to review: 3 ๐Ÿ”ต๐Ÿ”ต๐Ÿ”ตโšชโšช
    ๐Ÿงช No relevant tests
    ๐Ÿ”’ No security concerns identified
    โšก Key issues to review

    **Data Consistency:** Ensure that the new data source (parquet files) maintains consistency with the previous simulated data, especially in terms of data structure and content. **Exception Handling:** Review the changes in exception handling in `data_contracts.ipynb` to ensure they are appropriate and provide clear error messages. **Documentation Updates:** Verify that all documentation and comments accurately reflect the changes made, especially in Jupyter notebooks and the README file.
    codiumai-pr-agent-pro[bot] commented 5 months ago

    PR Code Suggestions โœจ

    CategorySuggestion                                                                                                                                    Score
    Possible bug
    Correct the case sensitivity in the DataFrame type hint ___ **Replace the use of pd.Dataframe with pd.DataFrame to correct the case sensitivity issue in
    the type hint, which could lead to runtime errors or issues with static type checkers.** [docs/examples/data_contracts.ipynb [812]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R812-R812) ```diff -def top_customers(df: pd.Dataframe, n: int=5) -> pd.DataFrame: +def top_customers(df: pd.DataFrame, n: int=5) -> pd.DataFrame: ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 10 Why: The correction from `pd.Dataframe` to `pd.DataFrame` is crucial as it prevents potential runtime errors and issues with static type checkers, ensuring the code functions correctly.
    10
    Best practice
    Add data validation after loading the dataframe to ensure it contains all expected columns ___ **It's recommended to validate the data loaded from external sources to ensure it meets
    expected formats and constraints. This can prevent issues arising from malformed or
    unexpected data.** [docs/examples/segmentation.ipynb [197-198]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R198) ```diff df = pd.read_parquet("../../data/transactions.parquet") +# Ensure the dataframe contains expected columns +expected_columns = {'transaction_id', 'transaction_datetime', 'customer_id', 'product_id', 'product_name', 'category_0_name', 'category_0_id', 'category_1_name', 'category_1_id', 'brand_name', 'brand_id', 'unit_price', 'quantity', 'total_price', 'store_id'} +assert expected_columns.issubset(df.columns), "Dataframe is missing one or more expected columns" df.head() ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 9 Why: This suggestion adds a crucial validation step to ensure the data meets expected formats, which can prevent downstream errors due to malformed data.
    9
    Set the random seed outside the function call for consistent outputs ___ **Ensure that the random seed is set outside the function call for reproducibility. This
    practice helps in maintaining consistent outputs for the random choices made in the
    notebook.** [docs/examples/cross_shop.ipynb [246-248]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR246-R248) ```diff -df.loc[shoes_idx, "category_1_name"] = np.random.RandomState(42).choice( +rng = np.random.RandomState(42) +df.loc[shoes_idx, "category_1_name"] = rng.choice( ["Shoes", "Jeans"], size=shoes_idx.sum(), p=[0.5, 0.5], ) ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 8 Why: This suggestion ensures reproducibility, which is important for consistent results, especially in a notebook setting.
    8
    Use a custom exception for clearer error handling ___ **Instead of raising a generic ValueError, raise a more specific custom exception to provide
    clearer error handling specific to the domain or application.** [docs/examples/data_contracts.ipynb [817]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R817-R817) ```diff -raise ValueError(msg) +class ContractValidationError(Exception): + pass +raise ContractValidationError(msg) ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 7 Why: Using a custom exception improves error handling by providing clearer and more specific error messages, which is a best practice for maintainable code.
    7
    Robustness
    Add error handling around the file reading operation to manage potential exceptions ___ **Consider adding error handling for file reading operations to manage exceptions that may
    occur if the file is missing or corrupt.** [docs/examples/segmentation.ipynb [197]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R197) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +try: + df = pd.read_parquet("../../data/transactions.parquet") +except Exception as e: + print(f"Failed to read data: {e}") + # Handle the error appropriately, possibly re-raise or log ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 9 Why: Adding error handling improves the robustness of the code by managing exceptions that may occur during file reading operations, preventing the program from crashing unexpectedly.
    9
    Enhancement
    Add a data type expectation for the 'total_price' column ___ **Ensure that the ExpectationConfiguration for the 'total_price' column includes a check for
    the column's data type, enhancing data validation and consistency.** [docs/examples/data_contracts.ipynb [895-897]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R895-R897) ```diff ExpectationConfiguration( expectation_type="expect_column_to_exist", kwargs={"column": "total_price"}, ), +ExpectationConfiguration( + expectation_type="expect_column_values_to_be_of_type", + kwargs={"column": "total_price", "type_": "float"}, +), ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 9 Why: Including a data type expectation for the 'total_price' column enhances data validation and consistency, ensuring that the data meets expected standards.
    9
    Add a check for an empty DataFrame to prevent errors ___ **Add a check to ensure that the DataFrame df is not empty before proceeding with sorting
    and returning the top customers. This prevents potential errors when operating on an empty
    DataFrame.** [docs/examples/data_contracts.ipynb [819]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R819-R819) ```diff +if df.empty: + return df return df.sort_values("total_price", ascending=False).head(n).reset_index(drop=True) ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 8 Why: Adding a check for an empty DataFrame enhances the robustness of the function by preventing potential errors when operating on an empty DataFrame.
    8
    Use pandas to_html for dynamic HTML table generation ___ **Replace the hard-coded HTML table with a dynamic generation using pandas DataFrame to_html
    method, which can be customized with CSS classes and other HTML attributes. This approach
    enhances code readability and maintainability.** [docs/examples/retention.ipynb [36-148]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-246c66d91f5e54e0cfe7bf8c39ffbc79919b309dc93e80d21c6332ab4b88c115R36-R148) ```diff - - - ... - - - ... - -

    +df.to_html(classes='dataframe', border=1)

    <details><summary>Suggestion importance[1-10]: 8</summary>
    
    Why: This suggestion enhances code readability and maintainability by leveraging pandas' built-in functionality, reducing the need for hard-coded HTML.
    
    </details></details></td><td align=center>8
    
    </td></tr><tr><td rowspan=1><strong>Possible issue</strong></td>
    <td>
    
    <details><summary>Add a check to ensure the DataFrame is not empty to prevent runtime errors</summary>
    
    ___
    
    **To ensure that the DataFrame is not empty before performing operations, add a check to <br>confirm that <code>df</code> is not empty after loading the data. This check prevents potential errors <br>in subsequent operations if the data file is missing or empty.**
    
    [docs/examples/cross_shop.ipynb [195-196]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR195-R196)
    
    ```diff
     df = pd.read_parquet("../../data/transactions.parquet")
    +if df.empty:
    +    raise ValueError("Data file is empty or not found.")
     df.head()
    
    • [ ] Apply this suggestion <!-- /improve --apply_suggestion=8 -->
    Suggestion importance[1-10]: 9 Why: This suggestion addresses a potential runtime error, which is crucial for ensuring the robustness of the code.
    9
    Maintainability
    Replace hardcoded file paths with environment variables for better flexibility and maintainability ___ **To avoid hardcoding file paths, consider using a configuration file or environment
    variables to manage file paths, making the code more flexible and easier to maintain
    across different environments.** [docs/examples/segmentation.ipynb [197]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R197) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +import os +data_path = os.getenv('DATA_PATH', '../../data/') +df = pd.read_parquet(data_path + "transactions.parquet") ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 8 Why: Using environment variables for file paths enhances the flexibility and maintainability of the code, making it easier to adapt to different environments.
    8
    Encapsulate data loading logic into a function for improved readability and reusability ___ **For better readability and maintenance, consider using a function to encapsulate the data
    loading logic, especially if similar data loading patterns are used multiple times in the
    notebook.** [docs/examples/segmentation.ipynb [197-198]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R198) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +def load_data(file_path): + return pd.read_parquet(file_path) + +df = load_data("../../data/transactions.parquet") df.head() ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 7 Why: Encapsulating the data loading logic into a function enhances code readability and reusability, especially if similar patterns are used multiple times in the notebook.
    7
    Replace inline CSS with external CSS file for DataFrame styling ___ **Consider using CSS classes instead of inline styles for the DataFrame HTML representation
    to improve maintainability and separation of concerns. This change will make it easier to
    manage styles globally and reduce redundancy in the notebook.** [docs/examples/retention.ipynb [23-35]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-246c66d91f5e54e0cfe7bf8c39ffbc79919b309dc93e80d21c6332ab4b88c115R23-R35) ```diff - + ```
    Suggestion importance[1-10]: 7 Why: Using an external CSS file improves maintainability and separation of concerns, but it requires additional setup to ensure the CSS file is available and correctly linked.
    7
    Use a variable for the file path to enhance flexibility and maintainability ___ **Replace the hard-coded file path with a variable that can be set at the top of the
    notebook. This change makes the notebook more flexible and easier to maintain, especially
    when the data source changes or when the notebook is used in different environments.** [docs/examples/cross_shop.ipynb [195]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR195-R195) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +data_file_path = "../../data/transactions.parquet" +df = pd.read_parquet(data_file_path) ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 7 Why: Using a variable for the file path makes the code more flexible and easier to maintain, which is a good practice but not critical.
    7
    Improve variable naming for better readability ___ **Consider using a more descriptive variable name instead of shoes_idx to enhance code
    readability. For example, shoes_category_filter would provide more context about the
    purpose of the variable.** [docs/examples/cross_shop.ipynb [245]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR245-R245) ```diff -shoes_idx = df["category_1_name"] == "Shoes" +shoes_category_filter = df["category_1_name"] == "Shoes" ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 6 Why: The suggestion improves code readability by using a more descriptive variable name, which is beneficial for maintainability but not critical.
    6
    Readability
    Improve DataFrame text display formatting in the notebook ___ **Ensure the DataFrame display in 'text/plain' output is properly formatted for better
    readability. Consider using pd.set_option to adjust display settings like max_columns,
    max_rows, or precision.** [docs/examples/retention.ipynb [153-179]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-246c66d91f5e54e0cfe7bf8c39ffbc79919b309dc93e80d21c6332ab4b88c115R153-R179) ```diff -" transaction_id transaction_datetime customer_id product_id \\\n", -"0 7108 2023-01-12 17:44:29 1 15 \n", -... +pd.set_option('display.max_columns', None) +pd.set_option('display.precision', 2) +df.head() ```
    Suggestion importance[1-10]: 6 Why: Adjusting display settings can improve readability, but the current formatting is already fairly readable. This is a minor enhancement.
    6
    Use Python dictionary syntax for arrow properties to enhance readability ___ **Replace the manual HTML arrow properties dictionary with a more readable format by using
    Python's dictionary syntax, which enhances code readability and maintainability.** [docs/examples/retention.ipynb [311]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-246c66d91f5e54e0cfe7bf8c39ffbc79919b309dc93e80d21c6332ab4b88c115R311-R311) ```diff -"arrowprops={\"facecolor\": \"black\", \"arrowstyle\": \"-|>\", \"connectionstyle\": \"arc3,rad=-0.25\", \"mutation_scale\": 25},\n", +"arrowprops=dict(facecolor='black', arrowstyle='-|>', connectionstyle='arc3,rad=-0.25', mutation_scale=25),\n", ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 5 Why: The existing code is already quite readable, and this change offers only a slight improvement in readability and maintainability.
    5