f4bD3v / humanitas

A price prediction toolset for developing countries
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

Read price csv into multi-index Pandas dataframe #15

Closed halccw closed 10 years ago

halccw commented 10 years ago

This part will be universal for other analysis and prediction, so I suggest we come up with a consensus.

Now I construct a multi-indexing dataframe from a csv file like this:

Dataframe
date price
product sub country city freq
Rice Common/Coarse India Chittoor week 2013-01-02 21.00
Rice Common/Coarse India Guntur week 2013-01-02 24.00
Rice Fine India Asansol week 2013-04-26 23.00
Rice Fine India Salem week 2013-04-26 24.00

Where the multi-index is built on 'product', 'sub','country' ,'city' and 'freq'.

Query

And with numexpr installed, we can extract any sub-dataframe we want like this:

sub1 = df.query('product == "Rice"')
sub2 = df.query('product == "Rice" & city == "Asansol" & sub == "Fine"')

where sub1 and sub2 are 2-column dataframes containing all "data, price" sorted by date for each predicate. Finally we can extract a certain time period like this:

after_july = sub1[sub1['date'] > '2013-07-01' ]

I'm quite new to Pandas. Any advice is welcome.

albu89 commented 10 years ago

I'm currently entertaining a friend so i will send you my requirements for the association rule mining tonight. i think we discussed this together already but i will need to convert the continuous variables into range variables i.e. if we have prices ranging from 0- 100 convert it to 0 - 49 == low, 50 - 100 == high.

Sent from my iPhone

On 13.04.2014, at 10:48, chingchia notifications@github.com wrote:

This part will be universal for other analysis and prediction, so I suggest we come up with a consensus.

Now I construct a multi-indexing dataframe from a csv file like this:

Dataframe

date price product sub country city freq
Rice Common/Coarse India Chittoor week 2013-01-02 21.00 Rice Common/Coarse India Guntur week 2013-01-02 24.00 Rice Fine India Asansol week 2013-04-26 23.00 Rice Fine India Salem week 2013-04-26 24.00 Where the multi-index is built on 'product', 'sub','country' ,'city' and 'freq'.

Query

And with numexpr installed, we can extract any sub-dataframe we want like this:

df.query('product == "Rice"') df.query('product == "Rice" & city == "Asansol" & sub == "Fine"') which yields a 2-column dataframe containing all "data, price" for one predicate sorted by date.

I'm quite new to Pandas. Any advice is welcome.

— Reply to this email directly or view it on GitHub.

halccw commented 10 years ago

@albu89 Since the range varies across products and even time, I suggest we could use "return in percentage" as in the analysis of stocks to categorize low and high.

Return at time i = render

If high means high volatility, we can define a upper and lower threshold like: '> 20% or < -20%' == 'high' otherwise == 'low'

If we can have more buckets, the performance of finding association could be better.

halccw commented 10 years ago

@albu89 If we categorize by the exact value of price, we might have trouble identifying correlated products. The price in India has an increasing trend. So we may have all "highs" after some year, and all "lows" before that.

So it would be more meaningful if we categorize the price series according to weekly or daily variation, i.e., compute the correlation on their volatility.

mstefanro commented 10 years ago

We have already agreed on how price data should look, so please stick to the same formatting. See https://github.com/fabbrix/humanitas/issues/9

halccw commented 10 years ago

@mstefanro Yes, that's exactly how Anton made those csv files. I'm talking about Pandas dataframes in the memory from which we can do the analysis.

halccw commented 10 years ago

Update: Some metadata of the India weekly price dataset:

number of distinct dates = 472, from 2005-01-07 to 2014-03-28 number of cities = 119, from Vishakhapatnam to Sonipet (unsorted) number of products = 44, from Maize to Milk (unsorted) number of (product, subproduct)s = 56, from ('Bread', 'Local') to ('Fish', 'Cat Fish') (unsorted)

mstefanro commented 10 years ago

I have also computed some stuff and added into analytics/statistics/india-weekly-... Feel free to add your discoveries to the same folder.

On 04/15/2014 12:36 AM, chingchia wrote:

Update: Some metadata of the India weekly price dataset:

number of distinct dates = 472, from 2005-01-07 to 2014-03-28 number of cities = 119, from Vishakhapatnam to Sonipet (unsorted) number of products = 44, from Maize to Milk (unsorted) number of (product, subproduct)s = 56, from ('Bread', 'Local') to ('Fish', 'Cat Fish') (unsorted)

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/15#issuecomment-40426605.

halccw commented 10 years ago

Update of weekly india dataset:

The result of filtering na:

2005-2014 na cutoff rate 0%, #series=7 na cutoff rate 5%, #series=223 na cutoff rate 10%, #series=593 2009-2014 na cutoff rate 0%, #series=18 na cutoff rate 5%, #series=319 na cutoff rate 10%, #series=603

Observation: There are slightly more "good" series in 2009-2014.

Some temporal graphs (before interpolation and cleaning spikes)

chicken

rice

wheat

Observation:

  1. Rice is of the best quality among these 3.
  2. There are many "flat plates" in every series, which means that the price does not always change in every week.
  3. For rice, it appears to be more volatile after 2008, which may be verified by first difference analysis.
  4. Those abnormal up/down "spikes" are resulted from the incorrect data points. I found out that for some series, they have duplicate data points for the same date, and the values could be very abnormal, please see https://github.com/fabbrix/humanitas/blob/master/analysis/ts/dup.txt

The spikes can will be removed by eliminating these values to NaN, and do linear interpolation series by series.

f4bD3v commented 10 years ago

Superb work!

Should we focus on 3 commodities to cut down on the workload? What do you think? Am 18.04.2014 19:58 schrieb "chingchia" notifications@github.com:

Update of weekly india dataset: The result of filtering na:

2005-2014 na cutoff rate 0%, #series=7 na cutoff rate 5%, #series=223 na cutoff rate 10%, #series=593 2009-2014 na cutoff rate 0%, #series=18 na cutoff rate 5%, #series=319 na cutoff rate 10%, #series=603

Observation: There are slightly more "good" series in 2009-2014. Some temporal graphs (before interpolation and cleaning spikes)

[image: chicken]https://cloud.githubusercontent.com/assets/4166714/2744259/f4432c0e-c722-11e3-8ce7-769482b3f9aa.png

[image: wheat]https://cloud.githubusercontent.com/assets/4166714/2744173/864dd68c-c721-11e3-9009-5ec65466a954.png

[image: rice]https://cloud.githubusercontent.com/assets/4166714/2744174/865194ca-c721-11e3-86da-46f05bc689d0.png

Observation:

  1. Rice is of the best quality among these 3.
  2. There are many "flat plates" in every series, which means that the price does not always change in every week.
  3. For rice, it appears to be more volatile after 2008, which may be verified by first difference analysis.
  4. Those abnormal up/down "spikes" are resulted from the incorrect data points. I found out that for some series, they have duplicate data points for the same date, and the values could be very abnormal, please see https://github.com/fabbrix/humanitas/blob/master/analysis/ts/dup.txt

The spikes can will be removed by eliminating these values to NaN, and do linear interpolation series by series.

— Reply to this email directly or view it on GitHubhttps://github.com/fabbrix/humanitas/issues/15#issuecomment-40829491 .

halccw commented 10 years ago

Yes we should. Maybe 3-5. The guide line would be to select those which are staple and have good qualities of data in every city.

mstefanro commented 10 years ago

We need a good way to reduce data from per-city to per-region by merging cities somehow. We might have to run PCA, but things will get too complicated then.

On 04/18/2014 10:50 PM, chingchia wrote:

Yes we should. Maybe 3-5. The guide line would be to select those which are staple and have good qualities of data in every city.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/15#issuecomment-40843301.

halccw commented 10 years ago

I've completed duplicate-removing and linear interpolation. Bad news. Some abnormal spikes still exist, which may be resulted from extremely high volatility, or incorrect data source. (e.g., wrong "product, subproduct, city" label in the datasets)

chi ric whe

@rev112
maybe you want to take a look at the abnormal duplicates in the following file. It seems that there may be "mixed" series in the dataset. https://github.com/fabbrix/humanitas/blob/master/analysis/ts/dup.txt

tonyo commented 10 years ago

@chingchia Ok, I'll take a look. We're talking about weekly prices, correct?

mstefanro commented 10 years ago

Let's focus on a a few more than 3. I don't think it's easier to have less (except from a computational pov). The model may be better if it has knowledge of more products at the same time (due to inter-product correlations).

On 04/18/2014 08:30 PM, Fabian Brix wrote:

Superb work!

Should we focus on 3 commodities to cut down on the workload? What do you think? Am 18.04.2014 19:58 schrieb "chingchia" notifications@github.com:

Update of weekly india dataset: The result of filtering na:

2005-2014 na cutoff rate 0%, #series=7 na cutoff rate 5%, #series=223 na cutoff rate 10%, #series=593 2009-2014 na cutoff rate 0%, #series=18 na cutoff rate 5%, #series=319 na cutoff rate 10%, #series=603

Observation: There are slightly more "good" series in 2009-2014. Some temporal graphs (before interpolation and cleaning spikes)

[image: chicken]https://cloud.githubusercontent.com/assets/4166714/2744259/f4432c0e-c722-11e3-8ce7-769482b3f9aa.png

[image: wheat]https://cloud.githubusercontent.com/assets/4166714/2744173/864dd68c-c721-11e3-9009-5ec65466a954.png

[image: rice]https://cloud.githubusercontent.com/assets/4166714/2744174/865194ca-c721-11e3-86da-46f05bc689d0.png

Observation:

  1. Rice is of the best quality among these 3.
  2. There are many "flat plates" in every series, which means that the price does not always change in every week.
  3. For rice, it appears to be more volatile after 2008, which may be verified by first difference analysis.
  4. Those abnormal up/down "spikes" are resulted from the incorrect data points. I found out that for some series, they have duplicate data points for the same date, and the values could be very abnormal, please see https://github.com/fabbrix/humanitas/blob/master/analysis/ts/dup.txt

The spikes can will be removed by eliminating these values to NaN, and do linear interpolation series by series.

— Reply to this email directly or view it on GitHubhttps://github.com/fabbrix/humanitas/issues/15#issuecomment-40829491 .

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/15#issuecomment-40832364.

mstefanro commented 10 years ago

@chingchia

Great work :)

Regarding interpolation, I've noticed you were doing some all_dates thingy by finding all the dates in the dataset. Instead, you may want to generate the date range yourself. This is in case some day is missing from all time series. In addition, please make sure you also treat the gaps when a time series starts late or stops early. For example, if the desired range is 1995-01-01 to 2014-02-28 and you have a time series starting from 1995-02-01, then you have to add 31 NA-s prior to running interpolation to make sure the time series is aligned with everything else. I hope this doesn't complicate things too much for you. Maybe you can use pd.date_range.

Note that interpolate doesn't extrapolate (so it leaves the NaN-s in front as they are). You have to additionally call bfill():

In [77]: df=pd.Series([np.nan, np.nan, np.nan, 3, 5, np.nan, np.nan, 10, 
np.nan, np.nan, np.nan])

In [78]: df
Out[78]:
0    NaN
1    NaN
2    NaN
3      3
4      5
5    NaN
6    NaN
7     10
8    NaN
9    NaN
10   NaN
dtype: float64

In [79]: df.interpolate()
Out[79]:
0           NaN
1           NaN
2           NaN
3      3.000000
4      5.000000
5      6.666667
6      8.333333
7     10.000000
8     10.000000
9     10.000000
10    10.000000
dtype: float64

In [80]: _.bfill()
Out[80]:
0      3.000000
1      3.000000
2      3.000000
3      3.000000
4      5.000000
5      6.666667
6      8.333333
7     10.000000
8     10.000000
9     10.000000
10    10.000000
dtype: float64
halccw commented 10 years ago

@rev112 Yes, it's weekly. Thanks.

@mstefanro Good points, thanks! I have them fixed in the latest commit.

tonyo commented 10 years ago

@chingchia I checked weekly prices for dates you mentioned. The reason for duplicated data is invalid parsing of .xls documents performed by one of my scripts. Unfortunately, .xls documents provided by rpms.dacnet.nic.in lack product name, so I used some heuristics to match them with prices. It turned out that it didn't always work smoothly. I will think about detecting this kind of false data a bit later (by the beginning of the next week), cannot do some serious work at the moment, sorry guys. The best workaround I can see right now is to consider data for weeks with ambiguous data as missing (and maybe even a few weeks before and after it).

halccw commented 10 years ago

@rev112 No problem. I can play with daily datasets first.

albu89 commented 10 years ago

@chingchia i can provide you with a weekly dataset that is prefiltered meaning that all duplicates have been removed, empty fields have been replaced with "NA" and all special symobls such as {"/", "(", ")"} have been removed. i'll upload it in a bit. you can asses yourself if it's of any help.