PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
200 stars 231 forks source link

Add Preprocess Function for Data Cleaning and Validation #3321

Open sambhavnoobcoder opened 6 days ago

sambhavnoobcoder commented 6 days ago

Description:

This PR introduces a new preprocess function designed to streamline the data cleaning and validation process. This function reads input data and site coordinates, validates the presence of specified date and carbon pool, and ensures the consistency of data dimensions. It outputs a structured list containing the cleaned data, ready for further analysis. Below is an extended description of the new function and its components.

Motivation and Context

Function: preprocess

Purpose:

The preprocess function is created to read and validate input data and site coordinates, ensuring that the data is correctly formatted and consistent for further processing. It handles potential inconsistencies in the data, providing informative messages and adjustments where necessary.

Parameters:

data_path: Path to the RDS file containing the input data. coords_path: Path to the CSV file containing site coordinates. date: The specific date for which the carbon data is to be extracted. C_pool: The specific carbon pool within the input data to focus on. Process:

Reading Data:

Reads the input data from the provided RDS file. Reads the site coordinates from the provided CSV file.

Validation:

Checks if the specified date exists in the input data. If not, the function stops and returns an error message. Extracts the carbon data for the specified date and validates the existence of the specified carbon pool. If the carbon pool is not found, the function stops and returns an error message.

Data Transformation:

Transposes the extracted carbon data to a data frame format, ensuring each column represents an ensemble. Renames the columns to a consistent naming convention (e.g., "ensemble1", "ensemble2", etc.). Coordinate Validation:

Ensures that the site coordinates data contains 'lon' and 'lat' columns. If these columns are missing, the function stops and returns an error message.

Data Consistency:

Validates that the number of rows in the site coordinates matches the number of rows in the carbon data. If there is a mismatch in the number of rows, the function truncates either the site coordinates or the carbon data to match the row counts, ensuring consistency.

Output:

The function returns a list containing:

input_data: The original input data read from the RDS file. site_coordinates: The validated and possibly truncated site coordinates. carbon_data: The validated and possibly truncated carbon data. Messages: The function provides informative messages during the preprocessing steps, alerting the user to any adjustments made to the data to ensure consistency.

Example Usage:

preprocessed_data <- preprocess("path/to/input_data.rds", "path/to/site_coords.csv", "2022-01-01", "TotalCarbon")

Benefits:

Efficiency: Streamlines the data preparation process, reducing manual validation and transformation steps. Error Handling: Provides clear error messages and handles common data issues, improving robustness. Consistency: Ensures consistent data formats and dimensions, facilitating further analysis and modeling.

Review Time Estimate

Types of changes

Checklist: