f4bD3v / humanitas

A price prediction toolset for developing countries
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

Convert all tabular data into CSV files #9

Closed mstefanro closed 10 years ago

mstefanro commented 10 years ago

Ideally, all the price data should be tabular, with the following columns:

Example of a row: 02/01/2000,week,Indonesia,Banda Aceh,milk,condensed,4000.0

tonyo commented 10 years ago

I'll add a script that traverses directories and converts all .XLS files to CSV.

mstefanro commented 10 years ago

Great, but this may need some manual work. Ideally it should be easy to programatically get answers to queries of the type "give me the time series of region A for product B", where A or B can also be "ALL". As such, we should have a table with the following columns:

Those tabular files have multiple table headers, spanning across multiple rows, with multiple sheets. So just running some xls2csv is not going to be enough to be able to e.g. load the data in R and work with it.

On 04/03/2014 07:50 PM, Anton Ovchinnikov wrote:

I'll add a script that traverses directories and converts all .XLS files to CSV.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/9#issuecomment-39482640.

halccw commented 10 years ago

I tried to align everything manually but in vain. The problem is that for each product, the number of regions and states that provide data in each year could be different.

tonyo commented 10 years ago

I think we should first focus on a couple of specific products (like rice, chicken, bread) and a few specific regions. What's the point of our system if it doesn't work for at least a subset?

Anyway, I added xls_to_csv.py to data_crunching/common, which converts all .xls files it can find under the specified directory. I decided not to push all .csv files to repo to avoid a mess, but if someone doesn't see a problem with it, feel free to commit them as well.

mstefanro commented 10 years ago

Anton, an interesting thing to try is to learn them all at once (I have a neural network model in mind). The system may work better at predicting all of them than predicting a subset, because they might be correlated.

On 04/03/2014 10:23 PM, Anton Ovchinnikov wrote:

I think we should first focus on a couple of specific products (like rice, chicken, bread) and a few specific regions. What's the point of our system if it doesn't work for at least a subset?

Anyway, I added xls_to_csv.py to data_crunching/common, which converts all .xls files it can find under the specified directory. I decided not to push all .csv files to repo to avoid a mess, but if someone doesn't see a problem with it, feel free to commit them as well.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/9#issuecomment-39500084.

mstefanro commented 10 years ago

@chingchia

There is no way we are going to attempt to implement neural networks in VBA :D

I don't see why it makes a difference if the regions providing the data are not the same. The cells where the data is not provided is already filled with a dash or something. Can't you just run some xls2csv tool on each sheet, then open the csv-s in python, reorganize the data in memory and then dump into a separate csv? I'd expect this to be ~40 lines of python code.

On 04/03/2014 09:44 PM, chingchia wrote:

I tried to align everything manually but in vain. The problem is that for each product, the number of regions that provide data in each year is different.

Other than R, considering the complexity of the data format, VBA for Excel (visual basic for application) might be useful. Anyone used it before?

Differentiating data for different regions can still be done by hand, very painstaking but not infeasible...

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/9#issuecomment-39495937.

halccw commented 10 years ago

@mstefanro

Yes I agree. Actually I meant using VBA to do the job you said using python. But using VBA seems a bad idea.

And yes, in some cases, missing data are filled with dashes. But if you look more closely to the original dataset (not the one I uploaded today), for each product you will find out that under each state, the number and the order of cities of that state will be different in different years.

Other than that, the subcategories of products differ in different years too. For example, rice has prices for "common" and "fine" in one year, and has data for "common", "fine", and "superfine" in another. So combining these two makes it harder to produce a correctly aligned version.

I also agree to Anton's suggestion. To hand pick some significant products in important states or cities for preliminary analysis.

tonyo commented 10 years ago

A small update: I have added one column to daily CSV files that describes tonnage for the given product and region. So now CSV files have 8 columns:

date, averaging-period (week/month), country (India/Indonesia), region (optional), product (e.g. milk), subproduct (e.g. condensed; optional), price, tonnage.

Example of a row: 01/01/2005,day,India,Biharsharif,Rice,Fine,1000,5.0

Also, CSV files will have column names as their first row to avoid confusion. I'll upload/update daily data for several products tonight.

mstefanro commented 10 years ago

Quick question: in the case of daily data, weren't we supposed to have ranges of prices, as opposed to exact values?

On 04/14/2014 01:35 PM, Anton Ovchinnikov wrote:

A small update: I have added one column to daily CSV files that describes tonnage for the given product and region, so now CSV files have 8 columns:

date, averaging-period (week/month), country (India/Indonesia), region (optional), product (e.g. milk), subproduct (e.g. condensed; optional), price, tonnage.

Example of a row: 01/01/2005,day,India,Biharsharif,Rice,Fine,1000,5.0

Also, CSV files will have column names as their first row to avoid confusion. I'll download/update daily data for several products tonight.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/9#issuecomment-40356585.

tonyo commented 10 years ago

Yes, there were ranges, but I used modal values (the last column). I believe ranges will only complicate implementation, but in any case min and max are present in raw data, which I also store.

mstefanro commented 10 years ago

It's alright this way, for now (but keep the raw data).

Do we only have daily data for rice?

On 04/14/2014 05:52 PM, Anton Ovchinnikov wrote:

Yes, there were ranges, but I used modal values (the last column). I believe ranges will only complicate implementation, but in any case min and max are present in raw data, which I also store.

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/9#issuecomment-40382845.

tonyo commented 10 years ago

Also for onion and wheat. We should choose several important products (ten will suffice, I think). Downloading all products will be a waste of time, there's too many of them, and the majority is described by data which is pretty scarce. Here's the product list: http://agmarknet.nic.in/cmmlist_today.asp?dt=14/04/2013

What about taking, say, apple, banana, carrot, potato, paddy(Dhan)? Unfortunately, there's no meat and bread there..

mstefanro commented 10 years ago

You can see the products for which we have the best weekly data in analysis/statistics/india-weekly-retail/products.txt Maybe we should use those for which there is the most weekly data as well.

On 04/15/2014 12:13 AM, Anton Ovchinnikov wrote:

Also for onion and wheat. We should choose several important products (ten will suffice, I think). Downloading all products will be a waste of time, there's too many of them, and the majority is described by data which is pretty scarce. Here's the product list: http://agmarknet.nic.in/cmmlist_today.asp?dt=14/04/2013

What about taking, say, apple, banana, carrot, potato, paddy(Dhan)? Unfortunately, there's no meat and bread there..

— Reply to this email directly or view it on GitHub https://github.com/fabbrix/humanitas/issues/9#issuecomment-40424724.