f4bD3v / humanitas

A price prediction toolset for developing countries
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

India product selection #18

Closed halccw closed 10 years ago

halccw commented 10 years ago

For the India weekly dataset, we may select our target products based on previous statistics and this table:

https://github.com/fabbrix/humanitas/blob/master/analysis/ts/na_table_org.csv

  1. I sorted products according to the column "city counts of cut off rate 0.2."
  2. Average rates show little difference among usable products.
  3. If we set the cutoff rate to 30% (although a bit much), we will have 35-40 cities for the top 10 products.
  4. Besides, it seems that we do not have to worry about the subproduct dimension except for rice.
  5. One interesting observation. Most cities report prices at an nearly constant rate.
mstefanro commented 10 years ago

@chingchia

Cut-off rate 0.2 means picking only those for which at least 80% of the series is known prior to interpolation?

Important: This may be confusing, but the "region" column in the daily and weekly datasets really means city, NOT region. To get the region, you need to join the datasets with the /data/india/csv_daily/agmarknet.nic.in/regions.csv file. I suggest we replace the column name now to avoid future confusion. Since our prediction model is per-region rather than per-city, maybe you should base your stats on per-region instead. When you are saying "35-40 cities" it is not very informative, because they may all be from the same region. And we are going to merge them using PCA (or averages?) in the end so we would really have one city, if it is indeed the case that they are all in the same region.

Besides, it seems that we do not have to worry about the subproduct dimension except for rice.

Our most important data-set is the daily one, not the weekly one. On the daily one, rice has 100 subproducts, onion has 26 subproducts, wheat has 68 subproducts etc. So we do have to worry about both city and subproducts. What we would like to do is the following:

let D be a mapping from all (R, P) to a time series
for each region R:
|    for each product P:
|    |   let M be a matrix.
|    |   for each subproduct SP (of product P):
|    |   |   for each city C (of region R):
|    |   |   |   let T be the time-series corresponding to (R,P,SP,C)
|    |   |   |   interpolate T to obtain a full time-series
|    |   |   |   add the vector T as a column to matrix M
|    |   let T = PCA(M, 1)
|    |   store a mapping from (R, P) to T into D

Try to make your code in such a way that it works both on the daily and weekly datasets. The only differences between the datasets are the date-range you have to pick and the gaps between dates for interpolation (1 week vs. 1 day). I can provide help with implementing this after we meet. We first need to go over your code.

One extra difficulty for the weekly dataset is that you might have to account for prices reported on the same week, but different day (I don't know if this occurs in the data, you should check). If that is the case, then you should really interpolate on week-of-the-year index rather than date index.

halccw commented 10 years ago

@mstefanro

Yes, 0.2 cutoff rate means choosing those series with at least 80% non-NaN data points before interpolation.

We can easily group series in the same region by looping region[0] = [city1, city2...]. I will add stats on region tmr.

The final point you mentioned is fine. Prices are always reported on Fridays.

in: all_dates_raw = sorted(list(set(df['date'])))
in: all_dates = pd.date_range(all_dates_raw[0], all_dates_raw[-1], freq='W-FRI')
in: list(set(all_dates) - set(all_dates_raw))

out: 
[Timestamp('2007-03-02 00:00:00', tz=None),
 Timestamp('2007-03-09 00:00:00', tz=None),
 Timestamp('2007-03-16 00:00:00', tz=None),
 Timestamp('2007-03-23 00:00:00', tz=None),
 Timestamp('2007-03-30 00:00:00', tz=None),
 Timestamp('2007-04-06 00:00:00', tz=None),
 Timestamp('2007-04-13 00:00:00', tz=None),
 Timestamp('2007-04-20 00:00:00', tz=None),
 Timestamp('2007-04-27 00:00:00', tz=None),
 Timestamp('2007-05-04 00:00:00', tz=None)]
mstefanro commented 10 years ago

Thanks for the feedback. I don't think you don't have to redo the statistics, I merely wanted to let you know that in the end we're going to need to have at least one city in each region of interest.

On 04/23/2014 12:17 AM, chingchia wrote:

@mstefanro https://github.com/mstefanro

Yes, 0.2 cutoff rate means choosing those series with at least 80% non-NaN data points before interpolation.

We can easily group series in the same region by looping region[0] = [city1, city2...]. I will add stats on region tmr.

The final point you mentioned is fine. Prices are always reported on Fridays.

in: all_dates_raw = sorted(list(set(df['date']))) in: all_dates = pd.date_range(all_dates_raw[0], all_dates_raw[-1], freq='W-FRI') in: list(set(all_dates) - set(all_dates_raw))

out:
[Timestamp('2007-03-02 00:00:00', tz=None), Timestamp('2007-03-09 00:00:00', tz=None), Timestamp('2007-03-16 00:00:00', tz=None), Timestamp('2007-03-23 00:00:00', tz=None), Timestamp('2007-03-30 00:00:00', tz=None), Timestamp('2007-04-06 00:00:00', tz=None), Timestamp('2007-04-13 00:00:00', tz=None), Timestamp('2007-04-20 00:00:00', tz=None), Timestamp('2007-04-27 00:00:00', tz=None), Timestamp('2007-05-04 00:00:00', tz=None)]

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/18#issuecomment-41102993.

f4bD3v commented 10 years ago

Among the series with acceptable cutoff rate, we should select those for important commodities

"Rice is the staple of the south, while bread => wheat is the staple of the north, of course with some cross over. Environmental conditions support this trend; with the largest rice growing in the south and wheat grown mainly in the north. Dal, which is Hindi for lentil, is eaten all over."

"Common vegetables used in cooking; potato, onion, okra, green beans, peas, cauliflower, capsicum, carrot (which are red), mushrooms, eggplant, chilli."

"Available fruits include apples, oranges, mandarins (which they call oranges), bananas, mango and pineapple."

source: http://www.thetravelalmanac.com/india/indian-food.htm

In this pdf Groundnut Oil and Peanut Oil are said to be the most used oils in India: http://www.umbrellaindia.com/Different-types-oils.pdf