D-Mielewczyk / euro-temperature-trend-stats

MIT License
0 stars 0 forks source link

Clean Data Using PySpark for Further Analysis #2

Closed D-Mielewczyk closed 3 months ago

D-Mielewczyk commented 3 months ago

Objective: Develop a PySpark script to clean the climate data fetched from the European Climate Assessment & Dataset (ECA&D). The cleaned data should be prepared and formatted appropriately for further analysis. This includes handling missing values, correcting data types, and any other necessary preprocessing steps.

Requirements:

  1. The script should:
    • Load the raw data into a Spark DataFrame.
    • Handle missing values by filling, dropping, or imputing them as appropriate.
    • Ensure all data types are correct and consistent.
    • Ensure that every column is cast to the most suitable type available.
    • Normalize or transform data if needed (e.g., scaling temperature values, parsing dates).
    • Save the cleaned data to a specified location for further analysis.

Details:

Acceptance Criteria:

Additional Notes:

Cleaning the data also involves making sure that every column is cast to the best available and most suitable type.

D-Mielewczyk commented 3 months ago

Not blocked at all, you can start working on this issue.