refactor: move data simulation to another package

Walkthrough

The recent changes primarily focus on shifting the data workflow from generating simulated data to loading pre-simulated data. This impacts multiple notebooks and documentation files, altering the instructions and examples accordingly. Additionally, the .gitignore file was updated to exclude .csv instead of .parquet files, and the pyproject.toml was modified to remove certain dependencies and script entries. Navigation in mkdocs.yml was also restructured for better clarity and organization.

Changes

Files	Change Summary
`.gitignore`	Updated to exclude `.csv` instead of `*.parquet` files.
`README.md`	Removed the section on generating simulated transaction data; replaced with "Coming Soon."
`docs/examples/cross_shop.ipynb`	Changed from simulating data to loading pre-simulated data; updated displayed data.
`docs/examples/data_contracts.ipynb`	Updated text and functionality for loading data; added new class and type hints in function parameters.
`docs/examples/gain_loss.ipynb`	Switched from simulating to loading pre-simulated data; updated brand names and prices.
`docs/examples/retention.ipynb`	Significant changes to load data from a file; included new imports and updated output visualizations.
`…/examples/…` (multiple files)	Grouped similar changes across multiple notebook files for brevity.
`mkdocs.yml`	Rearranged the navigation structure; removed outdated sections and links.
`pyproject.toml`	Removed `click` dependency; reordered some package versions; removed a script entry.

Poem

In the realm where data flows,
Files transformed and notebooks glowed,
From simulating days to pre-simulated ways,
Cleaner paths now boldly showed.
CSVs we shall hide,
In structured lines, our progress pried.
🌟🚀 A celebratory leap, with code we keep! 🚀🌟

[!TIP]

AI model upgrade
## `gpt-4o` model for reviews and chat is now live OpenAI claims that this model is better at understanding and generating code than the previous models. Please join our [Discord Community](https://discord.com/invite/GsXnASn26c) to provide any feedback or to report any issues.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

Tips

### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

PR Code Suggestions ✨

Category Suggestion Score

Possible bug

Correct the case sensitivity in the DataFrame type hint

___ **Replace the use of pd.Dataframe with pd.DataFrame to correct the case sensitivity issue in
the type hint, which could lead to runtime errors or issues with static type checkers.** [docs/examples/data_contracts.ipynb [812]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R812-R812) ```diff -def top_customers(df: pd.Dataframe, n: int=5) -> pd.DataFrame: +def top_customers(df: pd.DataFrame, n: int=5) -> pd.DataFrame: ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 10

Why: The correction from `pd.Dataframe` to `pd.DataFrame` is crucial as it prevents potential runtime errors and issues with static type checkers, ensuring the code functions correctly.

Best practice

Add data validation after loading the dataframe to ensure it contains all expected columns

___ **It's recommended to validate the data loaded from external sources to ensure it meets
expected formats and constraints. This can prevent issues arising from malformed or
unexpected data.** [docs/examples/segmentation.ipynb [197-198]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R198) ```diff df = pd.read_parquet("../../data/transactions.parquet") +# Ensure the dataframe contains expected columns +expected_columns = {'transaction_id', 'transaction_datetime', 'customer_id', 'product_id', 'product_name', 'category_0_name', 'category_0_id', 'category_1_name', 'category_1_id', 'brand_name', 'brand_id', 'unit_price', 'quantity', 'total_price', 'store_id'} +assert expected_columns.issubset(df.columns), "Dataframe is missing one or more expected columns" df.head() ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 9

Why: This suggestion adds a crucial validation step to ensure the data meets expected formats, which can prevent downstream errors due to malformed data.

Set the random seed outside the function call for consistent outputs

___ **Ensure that the random seed is set outside the function call for reproducibility. This
practice helps in maintaining consistent outputs for the random choices made in the
notebook.** [docs/examples/cross_shop.ipynb [246-248]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR246-R248) ```diff -df.loc[shoes_idx, "category_1_name"] = np.random.RandomState(42).choice( +rng = np.random.RandomState(42) +df.loc[shoes_idx, "category_1_name"] = rng.choice( ["Shoes", "Jeans"], size=shoes_idx.sum(), p=[0.5, 0.5], ) ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 8

Why: This suggestion ensures reproducibility, which is important for consistent results, especially in a notebook setting.

Use a custom exception for clearer error handling

___ **Instead of raising a generic ValueError, raise a more specific custom exception to provide
clearer error handling specific to the domain or application.** [docs/examples/data_contracts.ipynb [817]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R817-R817) ```diff -raise ValueError(msg) +class ContractValidationError(Exception): + pass +raise ContractValidationError(msg) ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 7

Why: Using a custom exception improves error handling by providing clearer and more specific error messages, which is a best practice for maintainable code.

Robustness

Add error handling around the file reading operation to manage potential exceptions

___ **Consider adding error handling for file reading operations to manage exceptions that may
occur if the file is missing or corrupt.** [docs/examples/segmentation.ipynb [197]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R197) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +try: + df = pd.read_parquet("../../data/transactions.parquet") +except Exception as e: + print(f"Failed to read data: {e}") + # Handle the error appropriately, possibly re-raise or log ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 9

Why: Adding error handling improves the robustness of the code by managing exceptions that may occur during file reading operations, preventing the program from crashing unexpectedly.

Enhancement

Add a data type expectation for the 'total_price' column

___ **Ensure that the ExpectationConfiguration for the 'total_price' column includes a check for
the column's data type, enhancing data validation and consistency.** [docs/examples/data_contracts.ipynb [895-897]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R895-R897) ```diff ExpectationConfiguration( expectation_type="expect_column_to_exist", kwargs={"column": "total_price"}, ), +ExpectationConfiguration( + expectation_type="expect_column_values_to_be_of_type", + kwargs={"column": "total_price", "type_": "float"}, +), ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 9

Why: Including a data type expectation for the 'total_price' column enhances data validation and consistency, ensuring that the data meets expected standards.

Add a check for an empty DataFrame to prevent errors

___ **Add a check to ensure that the DataFrame df is not empty before proceeding with sorting
and returning the top customers. This prevents potential errors when operating on an empty
DataFrame.** [docs/examples/data_contracts.ipynb [819]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R819-R819) ```diff +if df.empty: + return df return df.sort_values("total_price", ascending=False).head(n).reset_index(drop=True) ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 8

Why: Adding a check for an empty DataFrame enhances the robustness of the function by preventing potential errors when operating on an empty DataFrame.

Use pandas to_html for dynamic HTML table generation

___ **Replace the hard-coded HTML table with a dynamic generation using pandas DataFrame to_html
method, which can be customized with CSS classes and other HTML attributes. This approach
enhances code readability and maintainability.** [docs/examples/retention.ipynb [36-148]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-246c66d91f5e54e0cfe7bf8c39ffbc79919b309dc93e80d21c6332ab4b88c115R36-R148) ```diff - - - ... - - - ... - -

+df.to_html(classes='dataframe', border=1)

<details><summary>Suggestion importance[1-10]: 8</summary>

Why: This suggestion enhances code readability and maintainability by leveraging pandas' built-in functionality, reducing the need for hard-coded HTML.

</details></details></td><td align=center>8

</td></tr><tr><td rowspan=1><strong>Possible issue</strong></td>
<td>

<details><summary>Add a check to ensure the DataFrame is not empty to prevent runtime errors</summary>

___

**To ensure that the DataFrame is not empty before performing operations, add a check to <br>confirm that <code>df</code> is not empty after loading the data. This check prevents potential errors <br>in subsequent operations if the data file is missing or empty.**

[docs/examples/cross_shop.ipynb [195-196]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR195-R196)

```diff
 df = pd.read_parquet("../../data/transactions.parquet")
+if df.empty:
+    raise ValueError("Data file is empty or not found.")
 df.head()

[ ] Apply this suggestion

Suggestion importance[1-10]: 9

Why: This suggestion addresses a potential runtime error, which is crucial for ensuring the robustness of the code.

Maintainability

Replace hardcoded file paths with environment variables for better flexibility and maintainability

___ **To avoid hardcoding file paths, consider using a configuration file or environment
variables to manage file paths, making the code more flexible and easier to maintain
across different environments.** [docs/examples/segmentation.ipynb [197]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R197) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +import os +data_path = os.getenv('DATA_PATH', '../../data/') +df = pd.read_parquet(data_path + "transactions.parquet") ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 8

Why: Using environment variables for file paths enhances the flexibility and maintainability of the code, making it easier to adapt to different environments.

Encapsulate data loading logic into a function for improved readability and reusability

___ **For better readability and maintenance, consider using a function to encapsulate the data
loading logic, especially if similar data loading patterns are used multiple times in the
notebook.** [docs/examples/segmentation.ipynb [197-198]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R198) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +def load_data(file_path): + return pd.read_parquet(file_path) + +df = load_data("../../data/transactions.parquet") df.head() ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 7

Why: Encapsulating the data loading logic into a function enhances code readability and reusability, especially if similar patterns are used multiple times in the notebook.

Replace inline CSS with external CSS file for DataFrame styling

___ **Consider using CSS classes instead of inline styles for the DataFrame HTML representation
to improve maintainability and separation of concerns. This change will make it easier to
manage styles globally and reduce redundancy in the notebook.** [docs/examples/retention.ipynb [23-35]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-246c66d91f5e54e0cfe7bf8c39ffbc79919b309dc93e80d21c6332ab4b88c115R23-R35) ```diff - + ```

Suggestion importance[1-10]: 7

Why: Using an external CSS file improves maintainability and separation of concerns, but it requires additional setup to ensure the CSS file is available and correctly linked.

Use a variable for the file path to enhance flexibility and maintainability

___ **Replace the hard-coded file path with a variable that can be set at the top of the
notebook. This change makes the notebook more flexible and easier to maintain, especially
when the data source changes or when the notebook is used in different environments.** [docs/examples/cross_shop.ipynb [195]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR195-R195) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +data_file_path = "../../data/transactions.parquet" +df = pd.read_parquet(data_file_path) ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 7

Why: Using a variable for the file path makes the code more flexible and easier to maintain, which is a good practice but not critical.

Improve variable naming for better readability

___ **Consider using a more descriptive variable name instead of shoes_idx to enhance code
readability. For example, shoes_category_filter would provide more context about the
purpose of the variable.** [docs/examples/cross_shop.ipynb [245]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR245-R245) ```diff -shoes_idx = df["category_1_name"] == "Shoes" +shoes_category_filter = df["category_1_name"] == "Shoes" ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 6

Why: The suggestion improves code readability by using a more descriptive variable name, which is beneficial for maintainability but not critical.

Readability

Improve DataFrame text display formatting in the notebook

___ **Ensure the DataFrame display in 'text/plain' output is properly formatted for better
readability. Consider using pd.set_option to adjust display settings like max_columns,
max_rows, or precision.** [docs/examples/retention.ipynb [153-179]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-246c66d91f5e54e0cfe7bf8c39ffbc79919b309dc93e80d21c6332ab4b88c115R153-R179) ```diff -" transaction_id transaction_datetime customer_id product_id \\\n", -"0 7108 2023-01-12 17:44:29 1 15 \n", -... +pd.set_option('display.max_columns', None) +pd.set_option('display.precision', 2) +df.head() ```

Suggestion importance[1-10]: 6

Why: Adjusting display settings can improve readability, but the current formatting is already fairly readable. This is a minor enhancement.

Use Python dictionary syntax for arrow properties to enhance readability

___ **Replace the manual HTML arrow properties dictionary with a more readable format by using
Python's dictionary syntax, which enhances code readability and maintainability.** [docs/examples/retention.ipynb [311]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-246c66d91f5e54e0cfe7bf8c39ffbc79919b309dc93e80d21c6332ab4b88c115R311-R311) ```diff -"arrowprops={\"facecolor\": \"black\", \"arrowstyle\": \"-|>\", \"connectionstyle\": \"arc3,rad=-0.25\", \"mutation_scale\": 25},\n", +"arrowprops=dict(facecolor='black', arrowstyle='-|>', connectionstyle='arc3,rad=-0.25', mutation_scale=25),\n", ``` - [ ] **Apply this suggestion**

Suggestion importance[1-10]: 5

Why: The existing code is already quite readable, and this change offers only a slight improvement in readability and maintainability.

Data-Simply / pyretailscience