Following the changes that we have already implemented in this pipeline; here is the follow up (I'm writing it after having spent considerable energy in thinking about it). I would like to discuss these domains:

db schema

tables are created/destroyed by the script; and alembic is not used
also tests rely on *.sql scripts to create tables and load with data

This is actually a serious issue because all of our pipelines have to be installed automatically; so they have to provide automated upgrade/downgrade db/migrations (alembic upgrade head) Every manual create/delete of db schemas need to be eliminated and replaced with alembic.

mixture of copy_from and readers

run.py uses readers and copy_from
copy_from is executed as a binary and depends on installation of psql client

The code needs simplification, one unified interface - and that is provided by readers. If some datafiles are not covered by existing readers, new readers should be added so that - in the end - all copy_from calls can be completely removed.

API and object encapsulation

sessionmaker is called from many places
'set search path' is called from many places
connections are created in the run.py, then passed into a function which passes it into another function
ADSCelery instance is empty (no customizations but the instance is used)
objects are passed into functions in order to get different views (example is row2dict)

Those are examples of many deep-reaching dependencies. Instead, the Metrics/NonBib application should instantiate connection and hold it. Use it to query for data, retrieve instances of models and instead of passing those instances to other functions, call methods of these models. A standard pattern used across all other pipeline workers is to implement toJSON() method which returns dictionary (row2dict in this case). In summary, the pipeline should be revisited with the object-oriented view.

Metrics and NonBib integration

the separation between metrics and nonbib data seems arbitrary
there are (still) instances in the tests/code where the same data appears in metrics and nonbib
it is unclear why schema is used to hold metrics separate from nonbib data
one database with separate tables (models) would accomplish the same - and it is simpler

It seems there is no need for a separate Metrics and separate NonBib instance. The code could live well inside one applicaiton (app.ADSDataCelery). Furthermore, the presence of different schemas pose other challenges such as: having to set search paths; or non-automated deployment of the db schema (alembic). It would greatly simplify the code if the instance just stores data in separate tables (which it already does).

testing and coverage

coverage is at 52% (that is an improvement from previous stage)
run.py is not tested
tests are part of the code (sometimes)

Most of the business logic is implented inside run.py - which is ignored by coverage. Rather than including the tests for run.py, the logic should move out of run.py into the methods of the applicaiton. Also, the tests should be moved to their separate modules.

run.py many switches and options

currently 23 commands
that can be combined with several option arguments (e.g. metrics schema names 1 and 2)
it is not clear when they need be called

The business logic should actually be part of the app module (of the applicaiton). And run.py is only calling the interface provided by the applicaiton.

nonbib data are called 'rowview' pipeline

in run.py
and in models.py

It is confusing for a new developer to try to understand. Rowview suggests 'view on rows' - and especially with the joins and multiple schemas, that is confusing. All rowview should be named as nonbib (data/operations).

readers, joins and multiple schemas

readers are creating separate tables (i.e. one reader reads ADS classic file for datalinks and insert the data into a separate table datalinks)
data is then joined into one big view (not a table but sql view)
schemas are being renamed and dropped
new data import is used to compare and discover diffs with the previous data import
in some instances, data is simply read (copied) from the nonbib schema into the 'ingest' schema for metrics (and then presumably the schema is renamed)

I know I have asked and pointed at this issue multiple times. But I would like to revisit it again because by looking at the code I'm seeing lots of complexity in what could be a simple operation of reading ADS Classic source files. The end goal is to read data, and pass data to master pipeline - discover differences and only pass changed data - but there is a lot of 'smoke' of the schema changes and it is not clear what the intention is, nor why it is necessary.

The presumed advantage of separate schemas/data tables is that multiple processes can read data faster (in parallel). However, all tables are created by one python process (run.py:load_column_files) - and that defeats any advantage of parallel execution.

Furthermore, every entry that goes into the separate tables is keyed by its bibcode - hence if 12 workers are writing into 12 databases, or into 12 columns; it is still indexed by the bibcode. And I'm not seeing any code that would be taking advantage of separate tables for doing computation. All of the code that I am seeing is concerned with creating a joined (global) view on the data; but that is a computation to build 'aggregated view'.

The separate schemas should be used only (if at all) for loading data and discovering differences. Not to separate metrics from nonbib data (that data is already separate into separate tables).

The point being: schemas are for multi-tenancy situations; to separate data - but in essence there is no multi-tenancy situation, no need to separate data, there are no stored procedures (maybe they were there in the past if you used stored procedures or something). So I don't see what benefits the added complexity brings. I can very clearly point its costs however.

adsabs / AdsDataSqlSync

review and remodel #59