epogrebnyak / data-rosstat-kep

Time series dataset of Rosstat Short-term Economic Indicators ("KEP") publication
http://www.gks.ru/wps/wcm/connect/rosstat_main/rosstat/ru/statistics/publications/catalog/doc_1140080765391
6 stars 6 forks source link

For discussion - design issue: separate annual, qtr and monthly data on import or on query FOR REVIEW #86

Closed epogrebnyak closed 8 years ago

epogrebnyak commented 8 years ago

The script now imports flat labelled rows to database and separates annual, atr and monthly data on import (add link). Queries are run more often than import, maybe it should be other way around: write labelled rows to three tables, and query each table by get_dfa(), get_dfq(), get_dfm.

Design questions:

  1. Where labelled rows should be separated - in several database tables or (as done now) on querying of a single simple database table?
  2. Where year, quarter and month should be converted to date? In database or on querying (example: https://github.com/epogrebnyak/rosstat-kep-data/blob/master/kep/query/save.py#L56)
epogrebnyak commented 8 years ago

Blocks in question

Writing parsed data to database:

Querying (somewhat convoluted):

epogrebnyak commented 8 years ago

@gabrielelanaro, @baor, @Pastafarianist , @alexanderlukanin13: need quick opinion

alexanderlukanin13 commented 8 years ago

@epogrebnyak

  1. Do you ever plan to use annual, quarter and monthly data in a single dataframe? If not, they probably should be in separate tables without the freq column.
  2. Storing year, year/quarter and year/month as integers in their respective columns is probably the cleanest design. As for conversion to DateTimeIndex - I don't see anything inherently evil in using Python function get_end_of_quarterdate and get_end_of_monthdate, as long as it doesn't noticeably affect performance
epogrebnyak commented 8 years ago

Now file at https://github.com/epogrebnyak/rosstat-kep-data/blob/master/kep/rowsystem/rowsystem.py

baor commented 8 years ago

Where labelled rows should be separated - in several database tables or (as done now) on querying of a single simple database table?

Because all labels have similar structure and in the code all labels are combined into one dataframe, storing them in one single table is the best way.

Where year, quarter and month should be converted to date?

I would recommend to move all "reshapes" to kep.database.db-module. kep.database.db should represent an adapter for database. Input for this module is link to database, output- filled pandas dataframe.

dniku commented 8 years ago

I side with @alexanderlukanin13. Annual/quarterly/monthly data is never combined, so it makes sense to get rid of freq and also of NULLs in SQLite tables. I think that's the simplest database design possible, which doesn't involve any hacks and thus it's most predictable and easy to maintain.