adsabs / AdsDataSqlSync

For non-bibliographic ADS data
GNU General Public License v3.0
1 stars 6 forks source link

review and remodel #59

Open romanchyla opened 7 years ago

romanchyla commented 7 years ago

Following the changes that we have already implemented in this pipeline; here is the follow up (I'm writing it after having spent considerable energy in thinking about it). I would like to discuss these domains:

db schema

This is actually a serious issue because all of our pipelines have to be installed automatically; so they have to provide automated upgrade/downgrade db/migrations (alembic upgrade head) Every manual create/delete of db schemas need to be eliminated and replaced with alembic.

mixture of copy_from and readers

The code needs simplification, one unified interface - and that is provided by readers. If some datafiles are not covered by existing readers, new readers should be added so that - in the end - all copy_from calls can be completely removed.

API and object encapsulation

Those are examples of many deep-reaching dependencies. Instead, the Metrics/NonBib application should instantiate connection and hold it. Use it to query for data, retrieve instances of models and instead of passing those instances to other functions, call methods of these models. A standard pattern used across all other pipeline workers is to implement toJSON() method which returns dictionary (row2dict in this case). In summary, the pipeline should be revisited with the object-oriented view.

Metrics and NonBib integration

It seems there is no need for a separate Metrics and separate NonBib instance. The code could live well inside one applicaiton (app.ADSDataCelery). Furthermore, the presence of different schemas pose other challenges such as: having to set search paths; or non-automated deployment of the db schema (alembic). It would greatly simplify the code if the instance just stores data in separate tables (which it already does).

testing and coverage

Most of the business logic is implented inside run.py - which is ignored by coverage. Rather than including the tests for run.py, the logic should move out of run.py into the methods of the applicaiton. Also, the tests should be moved to their separate modules.

run.py many switches and options

The business logic should actually be part of the app module (of the applicaiton). And run.py is only calling the interface provided by the applicaiton.

nonbib data are called 'rowview' pipeline

It is confusing for a new developer to try to understand. Rowview suggests 'view on rows' - and especially with the joins and multiple schemas, that is confusing. All rowview should be named as nonbib (data/operations).

readers, joins and multiple schemas

I know I have asked and pointed at this issue multiple times. But I would like to revisit it again because by looking at the code I'm seeing lots of complexity in what could be a simple operation of reading ADS Classic source files. The end goal is to read data, and pass data to master pipeline - discover differences and only pass changed data - but there is a lot of 'smoke' of the schema changes and it is not clear what the intention is, nor why it is necessary.

The presumed advantage of separate schemas/data tables is that multiple processes can read data faster (in parallel). However, all tables are created by one python process (run.py:load_column_files) - and that defeats any advantage of parallel execution.

Furthermore, every entry that goes into the separate tables is keyed by its bibcode - hence if 12 workers are writing into 12 databases, or into 12 columns; it is still indexed by the bibcode. And I'm not seeing any code that would be taking advantage of separate tables for doing computation. All of the code that I am seeing is concerned with creating a joined (global) view on the data; but that is a computation to build 'aggregated view'.

The separate schemas should be used only (if at all) for loading data and discovering differences. Not to separate metrics from nonbib data (that data is already separate into separate tables).

The point being: schemas are for multi-tenancy situations; to separate data - but in essence there is no multi-tenancy situation, no need to separate data, there are no stored procedures (maybe they were there in the past if you used stored procedures or something). So I don't see what benefits the added complexity brings. I can very clearly point its costs however.

romanchyla commented 7 years ago

One small detail I missed:

some queries are stored in the config; this can backfire because config changes. Preferred option is to have ORM specified in the code; or if that's not possible have the query hardcoded in the code

example: run.py:142