Open zepor opened 2 months ago
The task involves implementing data validation and cleansing routines using Python and pandas, and documenting these routines. The solution will include defining data quality rules, creating validation and cleansing functions, and providing examples and documentation.
Create functions to check for common data quality issues such as missing values, duplicates, and data type mismatches.
Create functions to handle missing values (e.g., imputation or removal) and remove duplicates.
Add docstrings to explain the purpose and usage of each function.
Add examples of validation and cleansing scripts and provide detailed explanations of the validation criteria.
backend-container/src/utils/data_validation_and_cleansing.py
import pandas as pd
def check_missing_values(df):
"""
Check for missing values in the DataFrame.
Parameters:
df (pd.DataFrame): The DataFrame to check.
Returns:
pd.Series: A series indicating the number of missing values per column.
"""
return df.isnull().sum()
def check_duplicates(df):
"""
Check for duplicate rows in the DataFrame.
Parameters:
df (pd.DataFrame): The DataFrame to check.
Returns:
int: The number of duplicate rows.
"""
return df.duplicated().sum()
def check_data_types(df, expected_types):
"""
Check if the data types of the DataFrame columns match the expected types.
Parameters:
df (pd.DataFrame): The DataFrame to check.
expected_types (dict): A dictionary with column names as keys and expected data types as values.
Returns:
dict: A dictionary indicating whether each column matches the expected type.
"""
return {col: df[col].dtype == expected_type for col, expected_type in expected_types.items()}
def handle_missing_values(df, strategy='drop', fill_value=None):
"""
Handle missing values in the DataFrame.
Parameters:
df (pd.DataFrame): The DataFrame to clean.
strategy (str): The strategy to handle missing values ('drop', 'fill').
fill_value: The value to fill missing values with (if strategy is 'fill').
Returns:
pd.DataFrame: The cleaned DataFrame.
"""
if strategy == 'drop':
return df.dropna()
elif strategy == 'fill':
return df.fillna(fill_value)
else:
raise ValueError("Invalid strategy. Use 'drop' or 'fill'.")
def remove_duplicates(df):
"""
Remove duplicate rows from the DataFrame.
Parameters:
df (pd.DataFrame): The DataFrame to clean.
Returns:
pd.DataFrame: The cleaned DataFrame.
"""
return df.drop_duplicates()
# Example usage
if __name__ == "__main__":
# Sample data
data = {
'A': [1, 2, 2, None, 4],
'B': [5, None, 5, 6, 6]
}
df = pd.DataFrame(data)
# Validate data
print("Missing values per column:")
print(check_missing_values(df))
print("\nNumber of duplicate rows:")
print(check_duplicates(df))
expected_types = {'A': 'float64', 'B': 'float64'}
print("\nData type check:")
print(check_data_types(df, expected_types))
# Clean data
df_cleaned = handle_missing_values(df, strategy='fill', fill_value=0)
df_cleaned = remove_duplicates(df_cleaned)
print("\nCleaned DataFrame:")
print(df_cleaned)
README.md
## Data Validation and Cleansing
### Data Validation
Data validation is crucial to ensure the quality and integrity of the data. Below are examples of validation scripts using Python and pandas.
#### Example 1: Checking for Null Values
```python
import pandas as pd
# Load your data into a DataFrame
df = pd.read_csv('data.csv')
# Check for null values
null_values = df.isnull().sum()
print("Null values in each column:\n", null_values)
# Ensure 'age' column is of integer type
if not pd.api.types.is_integer_dtype(df['age']):
df['age'] = df['age'].astype(int)
# Ensure 'age' column values are within a valid range
valid_age_range = (df['age'] >= 0) & (df['age'] <= 120)
if not valid_age_range.all():
print("Some age values are out of the valid range.")
Data cleansing involves correcting or removing inaccurate records from a dataset. Below are examples of cleansing scripts using Python and pandas.
# Remove duplicate rows
df = df.drop_duplicates()
# Fill missing values with a specific value
df['column_name'].fillna('default_value', inplace=True)
# Alternatively, drop rows with missing values
df.dropna(subset=['column_name'], inplace=True)
# Convert 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'], errors='coerce')
Null Values:
Data Types:
Data Ranges:
Duplicates:
Missing Values:
By following these examples and criteria, you can ensure that your data is clean, consistent, and ready for analysis.
<!--- REPLY FORMAT END -->
[Click here to create a Pull Request with the proposed solution](https://api.codeautopilot.com/create-pr?task=29609&hash=6d69aa0a32fcdc9b44bb836918eaf25f2bbb1b3b9589d60b97853f29e2448a64)
**Files used for this task:**
<details>
<summary>Changes on backend-container/src/utils/data_validation_and_cleansing.py</summary>
To address the task of implementing data validation and cleansing routines in the `backend-container/src/utils/data_validation_and_cleansing.py` file, follow these steps:
1. **Define Data Quality Rules and Validation Criteria**:
- Create functions to check for common data quality issues such as missing values, duplicates, and data type mismatches.
2. **Implement Data Cleansing Operations**:
- Create functions to handle missing values (e.g., imputation or removal) and remove duplicates.
3. **Document the Code**:
- Add comments and docstrings to explain the purpose and usage of each function.
Here is a concrete implementation plan for the `data_validation_and_cleansing.py` file:
### Step-by-Step Implementation
1. **Import Necessary Libraries**:
- Import pandas and any other required libraries.
2. **Define Data Validation Functions**:
- Create functions to validate data quality rules.
3. **Define Data Cleansing Functions**:
- Create functions to clean the data based on the validation results.
4. **Example Code**:
- Provide example usage of the functions.
### Example Implementation
```python
import pandas as pd
def check_missing_values(df):
"""
Check for missing values in the DataFrame.
Parameters:
df (pd.DataFrame): The DataFrame to check.
Returns:
pd.Series: A series indicating the number of missing values per column.
"""
return df.isnull().sum()
def check_duplicates(df):
"""
Check for duplicate rows in the DataFrame.
Parameters:
df (pd.DataFrame): The DataFrame to check.
Returns:
int: The number of duplicate rows.
"""
return df.duplicated().sum()
def handle_missing_values(df, strategy='drop', fill_value=None):
"""
Handle missing values in the DataFrame.
Parameters:
df (pd.DataFrame): The DataFrame to clean.
strategy (str): The strategy to handle missing values ('drop', 'fill').
fill_value: The value to fill missing values with (if strategy is 'fill').
Returns:
pd.DataFrame: The cleaned DataFrame.
"""
if strategy == 'drop':
return df.dropna()
elif strategy == 'fill':
return df.fillna(fill_value)
else:
raise ValueError("Invalid strategy. Use 'drop' or 'fill'.")
def remove_duplicates(df):
"""
Remove duplicate rows from the DataFrame.
Parameters:
df (pd.DataFrame): The DataFrame to clean.
Returns:
pd.DataFrame: The cleaned DataFrame.
"""
return df.drop_duplicates()
# Example usage
if __name__ == "__main__":
# Sample data
data = {
'A': [1, 2, 2, None, 4],
'B': [5, None, 5, 6, 6]
}
df = pd.DataFrame(data)
# Validate data
print("Missing values per column:")
print(check_missing_values(df))
print("\nNumber of duplicate rows:")
print(check_duplicates(df))
# Clean data
df_cleaned = handle_missing_values(df, strategy='fill', fill_value=0)
df_cleaned = remove_duplicates(df_cleaned)
print("\nCleaned DataFrame:")
print(df_cleaned)
check_missing_values
and check_duplicates
are used to validate the data.handle_missing_values
and remove_duplicates
are used to clean the data.This implementation ensures that the data validation and cleansing routines are well-defined and documented, making it easier for other developers to understand and use the code.
Implement Data Validation and Cleansing Routines
Description: Create and execute data validation and cleansing scripts using Python. Tasks: