Open bokov opened 8 years ago
Originally I was going to go with python-generated SQL, but there are two problems that SQL solves in a simpler manner (or at least in a manner that I understand far better than how Python solves them):
order by
. I can see how one can do it to a dictionary, but we're not always working with dictionaries. I think some of these result-sets end up being lists of lists or of tuples. How do you reliably sort those?group by
. In Python? I guess map(foo, lambda: foo, bar)
? And how does that interact with the ordering? So, I guess now I'm in favor of in-memory SQL generation when order does not matter, otherwise SQL-based SQL generation but assisted by Python UDFs as necessary to make the SQL less verbose .
To-do: take a "typical" subquery, and pick apart where the irreducible atoms comprising it come from... perhaps there are intermediary columns and entire tables that only exist because I originally couldn't wrap my head around the whole process. With the new UDFs, hindsight, and better config file, we may be able to handle a more direct approach.
Too many tables depending on too many other tables. It works, but hard to understand.
Thinking along the lines of:
rules in-memory dictionary -[adds all needed rule info to]-> df_dtdict -[joins df_codeid for codeid=1 rows, and now each row describes one column in the output]-> df_cols (new) -[joins df_dtdict, chunks calculated, result-set stored in-memory and executed directly]-> tXXX -[joined to patient_dimension and each other using code generated from df_dtdict] -> df_finaloutput (renamed)
also df_cols used to generate a second, data-dictionary output file.
Total new tables: df_dtdict, df_codeid, df_cols, (tXXX,), df_finaloutput Total new views: view for output data dictionary, view for simple versoin of df_finaloutput