Closed mvanwyk closed 5 months ago
The recent changes primarily focus on shifting the data workflow from generating simulated data to loading pre-simulated data. This impacts multiple notebooks and documentation files, altering the instructions and examples accordingly. Additionally, the .gitignore
file was updated to exclude .csv
instead of .parquet
files, and the pyproject.toml
was modified to remove certain dependencies and script entries. Navigation in mkdocs.yml
was also restructured for better clarity and organization.
Files | Change Summary |
---|---|
.gitignore |
Updated to exclude .csv instead of *.parquet files. |
README.md |
Removed the section on generating simulated transaction data; replaced with "Coming Soon." |
docs/examples/cross_shop.ipynb |
Changed from simulating data to loading pre-simulated data; updated displayed data. |
docs/examples/data_contracts.ipynb |
Updated text and functionality for loading data; added new class and type hints in function parameters. |
docs/examples/gain_loss.ipynb |
Switched from simulating to loading pre-simulated data; updated brand names and prices. |
docs/examples/retention.ipynb |
Significant changes to load data from a file; included new imports and updated output visualizations. |
โฆ/examples/โฆ (multiple files) |
Grouped similar changes across multiple notebook files for brevity. |
mkdocs.yml |
Rearranged the navigation structure; removed outdated sections and links. |
pyproject.toml |
Removed click dependency; reordered some package versions; removed a script entry. |
In the realm where data flows,
Files transformed and notebooks glowed,
From simulating days to pre-simulated ways,
Cleaner paths now boldly showed.
CSVs we shall hide,
In structured lines, our progress pried.
๐๐ A celebratory leap, with code we keep! ๐๐
[!TIP]
AI model upgrade
## `gpt-4o` model for reviews and chat is now live OpenAI claims that this model is better at understanding and generating code than the previous models. Please join our [Discord Community](https://discord.com/invite/GsXnASn26c) to provide any feedback or to report any issues.
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
โฑ๏ธ Estimated effort to review: 3 ๐ต๐ต๐ตโชโช |
๐งช No relevant tests |
๐ No security concerns identified |
โก Key issues to review **Data Consistency:** Ensure that the new data source (parquet files) maintains consistency with the previous simulated data, especially in terms of data structure and content. **Exception Handling:** Review the changes in exception handling in `data_contracts.ipynb` to ensure they are appropriate and provide clear error messages. **Documentation Updates:** Verify that all documentation and comments accurately reflect the changes made, especially in Jupyter notebooks and the README file. |
Category | Suggestion | Score |
Possible bug |
Correct the case sensitivity in the DataFrame type hint___ **Replace the use ofpd.Dataframe with pd.DataFrame to correct the case sensitivity issue in the type hint, which could lead to runtime errors or issues with static type checkers.** [docs/examples/data_contracts.ipynb [812]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R812-R812) ```diff -def top_customers(df: pd.Dataframe, n: int=5) -> pd.DataFrame: +def top_customers(df: pd.DataFrame, n: int=5) -> pd.DataFrame: ``` - [ ] **Apply this suggestion** Suggestion importance[1-10]: 10Why: The correction from `pd.Dataframe` to `pd.DataFrame` is crucial as it prevents potential runtime errors and issues with static type checkers, ensuring the code functions correctly. | 10 |
Best practice |
Add data validation after loading the dataframe to ensure it contains all expected columns___ **It's recommended to validate the data loaded from external sources to ensure it meetsexpected formats and constraints. This can prevent issues arising from malformed or unexpected data.** [docs/examples/segmentation.ipynb [197-198]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R198) ```diff df = pd.read_parquet("../../data/transactions.parquet") +# Ensure the dataframe contains expected columns +expected_columns = {'transaction_id', 'transaction_datetime', 'customer_id', 'product_id', 'product_name', 'category_0_name', 'category_0_id', 'category_1_name', 'category_1_id', 'brand_name', 'brand_id', 'unit_price', 'quantity', 'total_price', 'store_id'} +assert expected_columns.issubset(df.columns), "Dataframe is missing one or more expected columns" df.head() ``` - [ ] **Apply this suggestion** Suggestion importance[1-10]: 9Why: This suggestion adds a crucial validation step to ensure the data meets expected formats, which can prevent downstream errors due to malformed data. | 9 |
Set the random seed outside the function call for consistent outputs___ **Ensure that the random seed is set outside the function call for reproducibility. Thispractice helps in maintaining consistent outputs for the random choices made in the notebook.** [docs/examples/cross_shop.ipynb [246-248]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR246-R248) ```diff -df.loc[shoes_idx, "category_1_name"] = np.random.RandomState(42).choice( +rng = np.random.RandomState(42) +df.loc[shoes_idx, "category_1_name"] = rng.choice( ["Shoes", "Jeans"], size=shoes_idx.sum(), p=[0.5, 0.5], ) ``` - [ ] **Apply this suggestion** Suggestion importance[1-10]: 8Why: This suggestion ensures reproducibility, which is important for consistent results, especially in a notebook setting. | 8 | |
Use a custom exception for clearer error handling___ **Instead of raising a genericValueError , raise a more specific custom exception to provide clearer error handling specific to the domain or application.** [docs/examples/data_contracts.ipynb [817]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R817-R817) ```diff -raise ValueError(msg) +class ContractValidationError(Exception): + pass +raise ContractValidationError(msg) ``` - [ ] **Apply this suggestion** Suggestion importance[1-10]: 7Why: Using a custom exception improves error handling by providing clearer and more specific error messages, which is a best practice for maintainable code. | 7 | |
Robustness |
Add error handling around the file reading operation to manage potential exceptions___ **Consider adding error handling for file reading operations to manage exceptions that mayoccur if the file is missing or corrupt.** [docs/examples/segmentation.ipynb [197]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-beed0b49cd64bc123ff2bf27db4f00fd930efb9557cec04cf0da9107814a0fe0R197-R197) ```diff -df = pd.read_parquet("../../data/transactions.parquet") +try: + df = pd.read_parquet("../../data/transactions.parquet") +except Exception as e: + print(f"Failed to read data: {e}") + # Handle the error appropriately, possibly re-raise or log ``` - [ ] **Apply this suggestion** Suggestion importance[1-10]: 9Why: Adding error handling improves the robustness of the code by managing exceptions that may occur during file reading operations, preventing the program from crashing unexpectedly. | 9 |
Enhancement |
Add a data type expectation for the 'total_price' column___ **Ensure that theExpectationConfiguration for the 'total_price' column includes a check for the column's data type, enhancing data validation and consistency.** [docs/examples/data_contracts.ipynb [895-897]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R895-R897) ```diff ExpectationConfiguration( expectation_type="expect_column_to_exist", kwargs={"column": "total_price"}, ), +ExpectationConfiguration( + expectation_type="expect_column_values_to_be_of_type", + kwargs={"column": "total_price", "type_": "float"}, +), ``` - [ ] **Apply this suggestion** Suggestion importance[1-10]: 9Why: Including a data type expectation for the 'total_price' column enhances data validation and consistency, ensuring that the data meets expected standards. | 9 |
Add a check for an empty DataFrame to prevent errors___ **Add a check to ensure that the DataFramedf is not empty before proceeding with sorting and returning the top customers. This prevents potential errors when operating on an empty DataFrame.** [docs/examples/data_contracts.ipynb [819]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-04035e37193454785e4756977df6abbbb198c33b853da3798a435ea143fde454R819-R819) ```diff +if df.empty: + return df return df.sort_values("total_price", ascending=False).head(n).reset_index(drop=True) ``` - [ ] **Apply this suggestion** Suggestion importance[1-10]: 8Why: Adding a check for an empty DataFrame enhances the robustness of the function by preventing potential errors when operating on an empty DataFrame. | 8 | |
Use pandas
___
**Replace the hard-coded HTML table with a dynamic generation using pandas DataFrame |
+df.to_html(classes='dataframe', border=1)
<details><summary>Suggestion importance[1-10]: 8</summary>
Why: This suggestion enhances code readability and maintainability by leveraging pandas' built-in functionality, reducing the need for hard-coded HTML.
</details></details></td><td align=center>8
</td></tr><tr><td rowspan=1><strong>Possible issue</strong></td>
<td>
<details><summary>Add a check to ensure the DataFrame is not empty to prevent runtime errors</summary>
___
**To ensure that the DataFrame is not empty before performing operations, add a check to <br>confirm that <code>df</code> is not empty after loading the data. This check prevents potential errors <br>in subsequent operations if the data file is missing or empty.**
[docs/examples/cross_shop.ipynb [195-196]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-74242af4acfafa7f389d644d9c6fba1ca0c589910a07f4b59ef42aa5191c31ccR195-R196)
```diff
df = pd.read_parquet("../../data/transactions.parquet")
+if df.empty:
+ raise ValueError("Data file is empty or not found.")
df.head()
shoes_idx
to enhance code shoes_category_filter
would provide more context about the pd.set_option
to adjust display settings like max_columns
, max_rows
, or precision
.**
[docs/examples/retention.ipynb [153-179]](https://github.com/Data-Simply/pyretailscience/pull/55/files#diff-246c66d91f5e54e0cfe7bf8c39ffbc79919b309dc93e80d21c6332ab4b88c115R153-R179)
```diff
-" transaction_id transaction_datetime customer_id product_id \\\n",
-"0 7108 2023-01-12 17:44:29 1 15 \n",
-...
+pd.set_option('display.max_columns', None)
+pd.set_option('display.precision', 2)
+df.head()
```
PR Type
Enhancement, Documentation
Description
data_contracts.ipynb
.Changes walkthrough ๐
data_contracts.ipynb
Refactor data contracts example to load data from parquet file
docs/examples/data_contracts.ipynb
top_customers
function.retention.ipynb
Refactor retention example to load data from parquet file
docs/examples/retention.ipynb
gain_loss.ipynb
Refactor gain/loss example to load data from parquet file
docs/examples/gain_loss.ipynb
cross_shop.ipynb
Refactor cross-shop example to load data from parquet file
docs/examples/cross_shop.ipynb
README.md
Update README to remove data simulation instructions
README.md
mkdocs.yml
Update mkdocs configuration to reflect new examples structure
mkdocs.yml
segmentation.ipynb
...
docs/examples/segmentation.ipynb ...
Summary by CodeRabbit
New Features
Documentation
mkdocs.yml
for better clarity and structure.Chores
.gitignore
to exclude.csv
files instead of.parquet
files.click
dependency and a script entry inpyproject.toml
.