Table hash rules based on CTAS

OHDSI / SqlRender

This is an R package and Java library for rendering parameterized SQL, and translating it to different SQL dialects.

https://ohdsi.github.io/SqlRender

Other

82 stars 77 forks source link

Table hash rules based on CTAS #69

Closed alondhe closed 7 years ago

alondhe commented 7 years ago

Hi @schuemie,

Looking for some guidance with regards to the intent of translateSql; is it to only translate OHDSql into viable SQL for our many dialects, or is to also optimize code for a dialect? I'm thinking particularly about these rules for Redshift: https://github.com/OHDSI/SqlRender/blob/master/inst/csv/replacementPatterns.csv#L232-L234

The reason I'm asking is because @ganisimov is working on testing CohortMethod and a few other packages on Redshift, and it looks like some of the queries are aliasing subject_id, person_id, or analysis_id to something else, causing an error to occur when trying to then hash on one of those 3 fields. Gennadiy has a solution for this, but I just wonder if perhaps we should leave table architecture decisions up to the author. Perhaps we add warnings to translateSql to tell the user that they may want to consider adding table hints given that they're creating a Redshift or PDW table that contains a field that we know to be optimal for hashing.

Thanks, Ajit

schuemie commented 7 years ago

Hi @alondhe,

I would say the intent of translateSql is to make life easier. If there are two viable translations for some code, then picking the most optimal one would fall into that intent.

That being said, I'm not particularly fond of the rules you refer to (which by the way are for PDW, not Redshift) since they make assumptions about the field names, and I would really like to keep SqlRender independent from the data structure for obvious reasons. With the advent of optimization hints in SqlRender they're no longer really needed, although it saves people writing hints everywhere, so I would say they're an ok default (still making life easier).

Could you give an example of how aliasing causes errors because of these rules?

ganisimov commented 7 years ago

Hi @schuemie ,

For example this query

SELECT analysis_id as col1, col2, col3 INTO #mytable FROM basetable WHERE col3 > 0

using Redshift dialect would be translated into

CREATE TABLE #mytable
DISTKEY(analysis_id) 
AS
SELECT analysis_id as col1, col2, col3 FROM basetable WHERE col3 > 0

which causes error like this

DISTKEY "analysis_id" must be included in the select list

I'm using regex variable feature to address such cases - https://github.com/ganisimov/SqlRender/blob/rs-sync/inst/csv/replacementPatterns.csv#L264-L266.

schuemie commented 7 years ago

Ah, I think I'm starting to understand. Several things are going on here:

The code in master only uses the heuristics when an explicit call is made to CREATE TABLE (without select but with column definitions in that statement). In that situation, aliases shouldn't be a problem because you can't use them in a CREATE TABLE (without select) statement.
The code you wrote for RedShift goes further, and also applies the heuristic to CREATE TABLE AS SELECT statements (SELECT INTO in the source dialect). I think I deliberately stayed out of those precisely because of the alias problem.
You seem to have fixed the alias problem in CREATE TABLE AS SELECT statements using regular expressions.

I would argue the heuristics are fine, since 9 of 10 times they do exactly what a developer of SQL code would like to happen, so they save us the trouble of adding explicit hints everywhere. @alondhe : would you agree, or do you think it is best we drop them altogether (which is cleaner)?

alondhe commented 7 years ago

Hi @schuemie -- sorry for the delay. I think @ganisimov 's solution can help resolve this issue and avoid the situation of needing so many explicit hints.

@ganisimov -- are you getting closer to making a pull request?

ganisimov commented 7 years ago

@alondhe , I'm about to make pull request, just wanted to run cohort generation one more time to double check that the latest changes don't break anything.