f4bD3v / humanitas

A price prediction toolset for developing countries
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

Some issues of the daily datasets #20

Closed halccw closed 10 years ago

halccw commented 10 years ago

Record some issues to be solved.

  1. Missing cities From the current 7 daily datasets (Rice, Wheat, Onion...), There are1308 cities(or towns or markets) in daily datasets that are not covered by regions.csv. I have not found an efficient way to solve it. The whole list please check: https://github.com/fabbrix/humanitas/blob/master/data/india/csv_daily/agmarknet.nic.in/missing_cities_daily.csv
  2. Duplicate dates and abnormal spikes problem

Rice

figure_1

figure_2

mstefanro commented 10 years ago

Duplicate dates exist for daily data as well? On 04/28/2014 07:05 PM, chingchia wrote:

Record some issues to be solved.

1.

Missing cities
From the 7 current daily datasets (Rice, Wheat, Onion...), There
are1308 cities in daily datasets that are not covered by
regions.csv. I have not found an efficient way to solve it. The
whole list please check:
https://github.com/fabbrix/humanitas/blob/master/data/india/csv_daily/agmarknet.nic.in/missing_cities_daily.csv

2.

Duplicate dates and abnormal spikes problem

Rice

figure_1 https://cloud.githubusercontent.com/assets/4166714/2820142/af3b8d78-cef6-11e3-8ff8-c916a7bd3eb1.png

figure_2 https://cloud.githubusercontent.com/assets/4166714/2820141/af3b28c4-cef6-11e3-9e7c-fbfb7ee159aa.png

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/20.

halccw commented 10 years ago

Yes, similar to the weekly ones.

The following is from daily Rice:

https://github.com/fabbrix/humanitas/blob/master/analysis/preproc/dup_daily.txt

tonyo commented 10 years ago

Duplicates arise from weird tabular data. See, for example, http://agmarknet.nic.in/cmm2_home.asp?comm=Rice&dt=28/01/2010, for Gajapathinagaram. There are two rows with empty subproducts, but the data is duplicated from previous rows. I made some additional checks, and it looks like all rows with missing subproduct are redundant. @chingchia Could you please try to ignore products with empty subproduct field and see what happens?

halccw commented 10 years ago

This is the duplication-record of excluding empty subproduct: (daily Rice)

https://github.com/fabbrix/humanitas/blob/master/analysis/preproc/dup_daily_rice_exclude_empty_subproduct.txt

The result seems nice, 2 identical dates to 1 identical price. I can eliminate them by taking of one of the duplicated 2-dates. (taking the non-zero tonnes one)