f4bD3v / humanitas

A price prediction toolset for developing countries
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

Potential of daily wholesale data (2005-2014) for prediction vs. daily retail (2009-2013) #26

Closed halccw closed 10 years ago

halccw commented 10 years ago

Before I dig into prediction, share and discuss some thoughts.

We have wholesale daily (2005-2014) and retail daily (2009-2013) datasets.

1. Include a few very good wholesale daily series into prediction goals

The wholesale daily dataset is sparse, but we have some very good series with more than 80%~90% of valid data in over 10 years which also appear very volatile and periodic. Although they are only tiny portions of the whole picture, I suggest we could still make good use of them to produce individual predictions.

Pre-interpolation graphs per region (zoom in or click it to see clearer graphs):

Uttar Pradesh Apple and onion appear volatile and periodic, but we should discard the rice here, since its price is very stable. 1

West Bengal Observe the periodic clustering of high volatility. 2

Gujarat Super volatile potato. 3

NCT of Delhi Wheat price 4

Some more to come tomorrow.

mstefanro commented 10 years ago

Do you think we should just pick the K best time-series and attempt to predict those? This will save us from having to care about subproducts or cities, since we are simply predicting "things for which there is data", rather than trying to uniformly predict the same things for each region.

On 05/09/2014 01:07 AM, chingchia wrote:

Before I dig into prediction, share and discuss some thoughts.

    1. Include a few very good /wholesale daily/ series into
    prediction goals

The wholesale daily dataset is sparse, but we have some very good series with more than 80%~90% of valid data in over 10 years which also appear very volatile and periodic. Although they are only tiny portions of the whole picture, I suggest we could still make good use of them to produce individual predictions.

Pre-interpolation graphs per region (zoom in or click it to see clearer graphs):

Uttar Pradesh Apple and onion appear volatile and periodic, but we should discard the rice here, since its price is very stable. 1 https://cloud.githubusercontent.com/assets/4166714/2922388/f20e8120-d700-11e3-9566-91cf3018b245.png

West Bengal Observe the periodic clustering of high volatility. 2 https://cloud.githubusercontent.com/assets/4166714/2922391/f23a6f9c-d700-11e3-89e2-944bd69fcde7.png

Gujarat Super volatile potato. 3 https://cloud.githubusercontent.com/assets/4166714/2922390/f2379506-d700-11e3-876e-54fc72c228f7.png

NCT of Delhi 4 https://cloud.githubusercontent.com/assets/4166714/2922389/f233e910-d700-11e3-8bd9-d1220570c72c.png

Some more to come tomorrow.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/26.

f4bD3v commented 10 years ago

@mstefanro, I would say that it is the way to go given the time constraints and the quality of the data. We can choose specific series and additionally try to feed in prices in neighbouring regions, social media indicators and weather data with a researched set-off. @chingchia Are there more of these series for wholesale data? Should we help checking the data to filter out good series or is the number very limited? What do you think of plotting the daily or weekly retail data for the same commodities and trying to infer regions after interpolating the good wholesale series with cubic spline?

For Delhi the time series is also potato?

Considering frequency of consumption: potato, onion and apple are good choice, it would also be nice find good series for rice, wheat and lentils

halccw commented 10 years ago

@fabbrix For the wholesale daily dataset, unfortunately these are the only series that have >80% valid data. For the NCT of Delhi graph, it's wheat.

Please check this table to get a sense of data availability of the wholesale dataset. Each cell represents the best valid-data-rate of each (product, subproduct) of each region. (note that 0.9=90%) If it's empty, it means that there is no series with more than 60% of valid data. From the table, we have a few good individual series of rice and wheat (shown in the previous graphs), and some not bad ones having about 70%-80% of valid data.

A table inferring data availability of wholesale daily: https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/best_non_na_0.4.csv

The same table for retail daily : https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-retail/best_non_na_0.4.csv

I will add some more graphs of the retail daily later.

f4bD3v commented 10 years ago

@chingchia Approach to build near-complete series:

halccw commented 10 years ago

wholesale daily regional plots

https://www.dropbox.com/s/25vd8fg5cznqqap/wholesale_daily_regional_plots_0.6.zip

  1. The cutoff rate is lowered to 60%
  2. 11 states in total: 'Andhra Pradesh', 'Gujarat', 'Jharkhand', 'Karnataka', 'Madhya Pradesh', 'Maharashtra', 'NCT of Delhi', 'Orissa', 'Punjab', 'Uttar Pradesh', 'West Bengal'
  3. 5 products included: 'Rice','Wheat','Apple','Potato','Onion'
  4. for all regions for all products, 24 plots are generated
  5. format of legend: (state, city, product, subproduct, valid-data-rate), e.g. (Andhra Pradesh, Chirala, Rice, B P T, 0.71)
halccw commented 10 years ago

Wholesale daily product plots

  1. The cutoff rate is 60%.
  2. For each product, I select the series with the most data availability in that region to make the graph readable
  3. format of legend is the same as above: (state, city, product, subproduct, valid-data-rate), e.g. (Andhra Pradesh, Chirala, Rice, B P T, 0.71)

Rice

figure_1

Wheat

figure_2

Apple

figure_3

Potato

figure_4

Onion

figure_5

f4bD3v commented 10 years ago

Usability review of selected wholesale series: Maharashtra: Onion NCT of Delhi: Potato, Wheat x2 Orissa: Wheat Uttar Pradesh: Apple, Onion (merge all series?), Potato (merge series with by averaging + noise?), Rice coarse vs. Rice fine?, Wheat (try merging) Gujarat (can't exactly make out series for subs): Wheat Jharkand (Ranchi): Fine Rice West Bengal: Potato (all series match well) => we could build a gaussian process out of them, Rice fine

halccw commented 10 years ago

The complete bundle of plots and tables for wholesale and retail, daily and weekly

including:

Datasets:

  1. Wholesale daily
  2. Retail daily
  3. Wholesale weekly (downsampled from daily)
  4. Retail weekly (downsampled from daily)

Selected products = [Rice','Wheat','Apple','Potato','Onion']

  1. 'per_region/': one plot for each selected product for each region
  2. 'per_products/': one plot for each product
  3. 'per_products_regional_best/': one plot for each product. one best series per region.
  4. 'num_series.csv': counts of series above cut off rate per region
  5. 'best_non_na.csv': best valid-data-rate per region

link: https://dl.dropboxusercontent.com/u/29566584/wholesale_retail_daily_weekly.zip

f4bD3v commented 10 years ago

Daily retail: Maharashtra: Onion NCT of Delhi: Onion, Potato

The regional best plot of onion seems to show a general country pattern while the standard deviation for rice and wheat stays more or less stable over the period with increasing prices (could try and match to inflation). But maybe we're introducing a bias by selecting regional best.

Weekly: Karnataka: Potato looks very nice Maharashtra: Onion, Potato, Rice, Wheat NCT of Delhi: Onion, Potato Orissa: Onion, Potato Rajasthan: Onion Tamil Nadu: Potato Uttar Pradesh: Potato West Bengal: Onion, Potato, Rice for most of the others bad data collection is evident

The price per product plots are very nice: For the weekly data they show that the onion price is very volatile but stable across regions, while the prices for rice and wheat are less volatile, however vary greatly across regions. Potato also very volatile and some difference between regions. Also inflation seems to manifest itself more in the price of rice and wheat than in the price of potato and onion.

Empirically motivate the choice of granularity: Time Series analysis of volatility granularity? Compute average price per region with standard deviation => compute average national price with standard deviation. How should we proceed to compute average national prices? By region?

For the network I think it is not too important with which offset exactly we feed in the weather data, because we have the reservoir has a memory property.

halccw commented 10 years ago

By looking at every series above 60% valid rate in the 3 datasets. I realized that

  1. Retail weekly is useless now. Even after a simple spike removing heuristic, there are too many spikes left.
  2. Retail daily is useless when we have the wholesale daily dataset. Many of its series look strange. For the usable ones, they have very similar patterns as those in wholesale daily. Its time span is much shorter than wholesale daily too.
  3. Wholesale daily is the most useful one. Its series perform good dynamics and with less spikes.

2. In Wholesale daily, merge series to construct an extended dataset with a more uniform profile over products and regions

Merge series with more than 60% of valid data of the same product within each region by averaging, to get (Uttar Pradesh, Potato), (West Bengal, Rice), etc.

plots per region legend = (region, product)

andhra pradesh gujarat haryana jharkhand maharashtra nct of delhi orissa punjab uttar pradesh west bengal