UTHSCSA-CIRD / datafinisher_backup

0 stars 2 forks source link

Making the internals of DF less convoluted #2

Open bokov opened 8 years ago

bokov commented 8 years ago

Too many tables depending on too many other tables. It works, but hard to understand.

Thinking along the lines of:

rules in-memory dictionary -[adds all needed rule info to]-> df_dtdict -[joins df_codeid for codeid=1 rows, and now each row describes one column in the output]-> df_cols (new) -[joins df_dtdict, chunks calculated, result-set stored in-memory and executed directly]-> tXXX -[joined to patient_dimension and each other using code generated from df_dtdict] -> df_finaloutput (renamed)

also df_cols used to generate a second, data-dictionary output file.

Total new tables: df_dtdict, df_codeid, df_cols, (tXXX,), df_finaloutput Total new views: view for output data dictionary, view for simple versoin of df_finaloutput

bokov commented 8 years ago

Originally I was going to go with python-generated SQL, but there are two problems that SQL solves in a simpler manner (or at least in a manner that I understand far better than how Python solves them):

  1. Order matters (sadly). In SQL it's trivial to insure thematically related rows are next to each other using order by. I can see how one can do it to a dictionary, but we're not always working with dictionaries. I think some of these result-sets end up being lists of lists or of tuples. How do you reliably sort those?
  2. Grouping matters. Again, in SQL your one-stop shop is group by. In Python? I guess map(foo, lambda: foo, bar) ? And how does that interact with the ordering?

So, I guess now I'm in favor of in-memory SQL generation when order does not matter, otherwise SQL-based SQL generation but assisted by Python UDFs as necessary to make the SQL less verbose .

bokov commented 8 years ago

To-do: take a "typical" subquery, and pick apart where the irreducible atoms comprising it come from... perhaps there are intermediary columns and entire tables that only exist because I originally couldn't wrap my head around the whole process. With the new UDFs, hindsight, and better config file, we may be able to handle a more direct approach.