epogrebnyak / data-rosstat-kep

Time series dataset of Rosstat Short-term Economic Indicators ("KEP") publication
http://www.gks.ru/wps/wcm/connect/rosstat_main/rosstat/ru/statistics/publications/catalog/doc_1140080765391
6 stars 6 forks source link

For review - new program structure layout (test-driven+'rowsystem' data structure) #87

Closed epogrebnyak closed 8 years ago

epogrebnyak commented 8 years ago

I made a prototype of the parser for this program that uses a new data structure to represent csv file contents as a list of dicts with original data, modified data and extra info used in parsing. My hope it can make csv parsing more traceable and explicit (the 'rowsystem' stores the state of parsing at different stages, while in previously used generators/lists parsing data lived only once).

My request is to review the prototype in https://github.com/epogrebnyak/rosstat-kep-data/blob/master/rowsystem.py and write some impressions/comments before I transfer more code in there.

epogrebnyak commented 8 years ago

@gabrielelanaro , @baor , @Pastafarianist , @alexanderlukanin13 : can you have a look at this and write brief comment here in issues? something what is not clear from the file, perhaps?

gabrielelanaro commented 8 years ago

The way you splitted the code is reasonable, and I would encapsulate this rowsystem in a class that keeps the state of the rowsystem, so that you can use it in this way:

>>>  rs = RowSystem(csv_input) # Initialize and parse the dataframe
>>> rs.dataframe # Retrieve the underlying pandas dataframe 
>>> rs.label(dicts) # this will update/replace rs.dataframe with the labeled version of it 
>>> rs.annual() # This will update/replace rs.dataframe with the annulal version of it

You could implement it in this way:

class RowSystem:

    def __init__(self, doc):
       # parse dataframe
       self.dataframe = # parsed dataframe

   def label(self, dicts):
       # Process self.dataframe
       self.dataframe = # replace self.dataframe

  def annual(self):
      # same as label

In this way you can also add more methods for different processing options, or you can store more data/configurations as attributes if needed.

epogrebnyak commented 8 years ago

@gabrielelanaro: Indeed, this may be a class, thanks for pointing out - and the code too. Shoudl be very intuitive on user end. I think I will work on methods as functions though until they are stable and later encapsulate as class.

baor commented 8 years ago

I suggest you to store all test data explicitly, this will highlight all possible redundancy: https://github.com/epogrebnyak/rosstat-kep-data/blob/941f0131f2e1ad16b6e5be678cce2cbd0039a843/kep/rowsystem/rowsystem.py#L123-L137

As for me, I don't understand the point of list of keys below, because their values are absent in csv file: 'head' 'header_label' 'unit_label' 'dicts'

epogrebnyak commented 8 years ago

@baor: The values are not is csv file, bacause they are result of labelling procedure: we take csv file + dicts and obtain 'header_label' and 'unit_label' for each row.

As for explicit test data is https://github.com/epogrebnyak/rosstat-kep-data/blob/master/kep/rowsystem/rowsystem.py#L140-L204 better now?

epogrebnyak commented 8 years ago

For my notes:

rs = RowSystem(csv_input) # read raw csv into class instance
rs.label(dicts) #add lables to csv rows based on dicts  
rs.label(dicts, segments) #add lables to csv rows based on core dicts and segments information
dfa = rs.dfa() # get annual dataframe from labelled rows
dfq = rs.dfq() # get quarterly dataframe from labelled rows
dfm = rs.dfm() # get monthly dataframe from labelled rows
dniku commented 8 years ago

Well, the description in rowsystem.py basically describes explicitly what is already implemented, with the exception that these RS_FROM_FILE-style dicts are not actually passed around.

I don't like the fragility of this approach, there are many assumptions. We should probably add a lot of checks that whatever we're parsing actually conforms to these assumptions and throw an exception if it doesn't.

epogrebnyak commented 8 years ago

Based on discussion above the program has following class structure:

from kep.reader.rs import CurrentMonthRowSystem
from kep.extract.publish import Publisher
from kep.extract.dataframes import KEP

# read raw csv datafile and save data to sqlite and clean CSV dumps 
CurrentMonthRowSystem().update()

# visualise data in 'ouput' directory  - write Excel, CSF, PDS and png files
Publisher().publish()

# access available data as pandas dataframes and annual, quarter and monthly frequencies
dfa, dfq, dfm = KEP().dfs()

More comments here.