Closed alondhe closed 7 years ago
Hi @alondhe,
I would say the intent of translateSql is to make life easier. If there are two viable translations for some code, then picking the most optimal one would fall into that intent.
That being said, I'm not particularly fond of the rules you refer to (which by the way are for PDW, not Redshift) since they make assumptions about the field names, and I would really like to keep SqlRender independent from the data structure for obvious reasons. With the advent of optimization hints in SqlRender they're no longer really needed, although it saves people writing hints everywhere, so I would say they're an ok default (still making life easier).
Could you give an example of how aliasing causes errors because of these rules?
Hi @schuemie ,
For example this query
SELECT analysis_id as col1, col2, col3 INTO #mytable FROM basetable WHERE col3 > 0
using Redshift dialect would be translated into
CREATE TABLE #mytable
DISTKEY(analysis_id)
AS
SELECT analysis_id as col1, col2, col3 FROM basetable WHERE col3 > 0
which causes error like this
DISTKEY "analysis_id" must be included in the select list
I'm using regex variable feature to address such cases - https://github.com/ganisimov/SqlRender/blob/rs-sync/inst/csv/replacementPatterns.csv#L264-L266.
Ah, I think I'm starting to understand. Several things are going on here:
The code in master only uses the heuristics when an explicit call is made to CREATE TABLE (without select but with column definitions in that statement). In that situation, aliases shouldn't be a problem because you can't use them in a CREATE TABLE (without select) statement.
The code you wrote for RedShift goes further, and also applies the heuristic to CREATE TABLE AS SELECT statements (SELECT INTO in the source dialect). I think I deliberately stayed out of those precisely because of the alias problem.
You seem to have fixed the alias problem in CREATE TABLE AS SELECT statements using regular expressions.
I would argue the heuristics are fine, since 9 of 10 times they do exactly what a developer of SQL code would like to happen, so they save us the trouble of adding explicit hints everywhere. @alondhe : would you agree, or do you think it is best we drop them altogether (which is cleaner)?
Hi @schuemie -- sorry for the delay. I think @ganisimov 's solution can help resolve this issue and avoid the situation of needing so many explicit hints.
@ganisimov -- are you getting closer to making a pull request?
@alondhe , I'm about to make pull request, just wanted to run cohort generation one more time to double check that the latest changes don't break anything.
Hi @schuemie,
Looking for some guidance with regards to the intent of translateSql; is it to only translate OHDSql into viable SQL for our many dialects, or is to also optimize code for a dialect? I'm thinking particularly about these rules for Redshift: https://github.com/OHDSI/SqlRender/blob/master/inst/csv/replacementPatterns.csv#L232-L234
The reason I'm asking is because @ganisimov is working on testing CohortMethod and a few other packages on Redshift, and it looks like some of the queries are aliasing subject_id, person_id, or analysis_id to something else, causing an error to occur when trying to then hash on one of those 3 fields. Gennadiy has a solution for this, but I just wonder if perhaps we should leave table architecture decisions up to the author. Perhaps we add warnings to translateSql to tell the user that they may want to consider adding table hints given that they're creating a Redshift or PDW table that contains a field that we know to be optimal for hashing.
Thanks, Ajit