f4bD3v / humanitas

A price prediction toolset for developing countries
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

Situation of the wholesale daily dataset #22

Closed halccw closed 10 years ago

halccw commented 10 years ago

It is very sparse.

If we set the valid-threshold to 70% (meaning that only keeps series which has at least 70% of non-NaN data). We only get 15 (product, subproduct), and most regions have only a few (product, subproduct) data.

See the following 2 tables:

num_cities: Each cell represents the number of cities that has at least 70% valid data in that region https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/num_cities_0.3.csv

best_non_na: Each cell represents the max among valid percentages of cities in that region https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/best_non_na_0.3.csv

Even if I reduce the valid-threshold to 60%, the data is still sparse. https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/num_cities_0.4.csv https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/best_non_na_0.4.csv

The attempt to reduce time period to 3 years (2011-2014) in order to have less sparsity did not work well. The result looks very similar to the one with the whole time span. https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/num_cities_3y_0.4.csv

halccw commented 10 years ago

Note that this dataset has data from 2005 to 2014 March.