abjer / sds2019

Social Data Science 2019 - a summer school course
https://abjer.github.io/sds2019
46 stars 96 forks source link

Ex. 6.1.5 #20

Open IAmAndreasSK opened 5 years ago

IAmAndreasSK commented 5 years ago

Hi,

I tried to create the code for 6.1.5. I tested it line by line, and it doesn't seem to work with the loop. I am trying to get the country codes and insert them into the appropriate column, "Country_Codes". What should I change?

Note: This is not the full code I intend to write, but it's what I have so far

def weather(year):
    import pandas as pd
    import re
    url="https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/"+year+".csv.gz"
    data=pd.read_csv(url,header=-1)
    data=data.drop(data.columns[4:],axis=1)
    COLS=["Station_Identifier","Observation_Time","Observation_Type","Observation_Value"]
    data.columns=COLS
    data["Observation_Value"]=data["Observation_Value"]/10
    data.round(decimals=2)
    data2=data.loc[(data["Observation_Type"]=="TMAX")]
    data2["TMAX_F"]=data2["Observation_Value"]*1.8+32
    data2["Observation_Time"]=data2["Observation_Time"].astype(str)
    data2.Observation_Time=pd.to_datetime(data2["Observation_Time"]) #.loc[row_indexer,col_indexer] = value instead
    data2["Month"]=data2["Observation_Time"].dt.month
    data2["Country_Code"]=""
    for i,row in data2.iterrows():
            data2.loc[i,"Country_Code"]=" ".join(re.findall("[a-zA-Z]+", data2.loc[i,"Station_Identifier"]))
    data2.set_index("Observation_Time")
    print(data2)

weather("1905")`

Thanks, Andreas

sebastianbaltser commented 5 years ago

Line number 11 should be: data2 = data.loc[(data["Observation_Type"]=="TMAX")].copy() Using just data2=data.loc[(data["Observation_Type"]=="TMAX")] will result in a view of data being assigned to data2. You don't want that as that can give you problems later, when modifying values to data2. What you want is a copy of the dataset to be assigned to data2.

kristianolesenlarsen commented 5 years ago

I think Sebastian found your mistake. However here are some general comments:

These two lines don't belong in a function body (Imports go at the top of the file)

    import pandas as pd
    import re

You could replace

url="https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/"+year+".csv.gz"

with

url="https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{}.csv.gz".format(year)

This

data2["Country_Code"]=""
for i,row in data2.iterrows():
        data2.loc[i,"Country_Code"]=" ".join(re.findall("[a-zA-Z]+", data2.loc[i,"Station_Identifier"]))

is overly complicated. It could be done by

data2['Country_Code'] = data2['Station_Identifier'].str.extract("([a-zA-Z]+)")

or

data2['Country_Code'] = data2['Station_Identifier'].apply(lambda x: re.findall('[a-zA-Z]+', x)[0] )