SophMC / notechain

A daily diary of learning data science skills.
13 stars 4 forks source link

create a timestamp column in a pandas dataframe #1

Closed SophMC closed 8 years ago

SophMC commented 8 years ago

@aaren I am reading in a txt file with the format year month day hour ws
and want to create an extra column which takes the first four columns and makes a timestamp out of them and makes a new column. This is proving difficult for several reasons:

date_spec = {'date_time': [0,1,2]} this works in wind below. date_spec = {'date_time': [0,1,2,3]} this doesn't. It creates a plain object for date_time, and we want a datetime object.

wind = pd.read_csv('/home/sophie/projects/windspeed/data/61401BirMoghrein_allwinds.txt', sep=" ", names=column_names, parse_dates=date_spec, keep_date_col=True)

d = datetime(2013, 12, 22, 11, 30, 59); print d works 2013-12-22 11:30:59

years = [2012, 2013, 2014] months=[5, 4, 3] days=[6, 5, 4] e = datetime(years, months, days); print e doesn't work

As far as I can see you can only input for value at a time - it doesn't take lists.

pd.to_datetime(df[['year', 'month', 'day']]) doesn't work, substituting df=wind

This other suggestion from the documentation (I copied exactly) df = pd.DataFrame({'year': [2015, 2016], 'month': [2, 3],'day': [4, 5],'hour': [2, 3]}) pd.to_datetime(df) doesn't work

Maybe I can update to 0.18.1 some other way? Any ideas most welcome. I'll continue the fight tomorrow.

SophMC commented 8 years ago

Solution!

specify the columns you want to group together. Can't include hour at this point as it is not in the right format.

date_spec = {'date_time': [0,1,2]}

Read and parse date_time simultaneously.

When you use keep_dat_col it keeps them as objects, not as the dtype you read them in as.

wind = pd.read_csv(datafile, sep=" ", names=column_names, parse_dates=date_spec, keep_date_col=True, index_col=False )

Dealing with hour - going from 600, 1200 etc to 6,12, 18

wind["hour"]=(wind["hour"]/100).astype(int)

combining date_time with hour, which is now in the correct format.

wind['date_time'] = pd.to_datetime(wind.date_time) + wind.hour.astype('timedelta64[h]')

print wind.date_time[0:5]

0 1984-03-01 06:00:00 1 1984-03-01 12:00:00 2 1984-03-01 18:00:00 3 1984-03-02 06:00:00 4 1984-03-02 12:00:00 Name: date_time, dtype: datetime64[ns]

The light bulb moment came from discovering the line wind.hour.astype('timedelta64[h]')