Open Haydenfiege opened 3 years ago
Just a quick note in case you didn't see this in my earlier code, the two lines below should read the data into pandas directly without the need to go through a csv save/read(unless there's reasons for needing to save csv for other uses), currently I put it into a loop to download weather data over a range of dates: csv_dl = requests.get(csv_url).content df = pd.read_csv(StringIO(csv_dl.decode("utf8")))
I've also made some basic etl cleaning procedures in my weather scrape function, under the "weather_proc" function where I used interpolation to fill in missing values that have a column name containing some reference of temperature (assuming temperature can be reasonably interpolated when not available, any columns with a "date" reference in the name are converted to datetime, and any other column that has missing value gets filled with a zero.
For a full stack experience, we can take a bit of the standard approach to save raw CSVs into S3 storage, curated data into SQL database and use a query/view from there to feed input to our actual modelling workflow.
If we ended up doing some feature engineering or encoding on the data for modelling purposes after our eda, we can decide on whether these transformation happen during the etl or as part of the modelling process later.
Right now we are pulling weather data from the government of Canada, and some numerical columns have letters in them that refer to comments provided on the website's legend. These comments relate to things like missing data, trace amounts of precipitation, estimated temperatures, precipitation occurred, amount uncertain etc. We should be safe to remove these data points from our data set as part of the cleaning processs.