Pass through dbms value as an @parameter so it is available within ohdisql code

leeevans commented 7 years ago

This is an enhancement request.

I'd like to be able to implement the following type of logic in ohdisql in order to include DBMS specific SQL code.

{@dbms=='<dbms name>'}?{
   -- <dbms name> specific code
}:{
   -- other dbms code
}

My use case is to add a single line of Impala specific code in Achilles ohdisql but I see this standardized way to access the dbms value in ohdisql code as a generally useful enhancement.

I believe the change to implement this would be to add dbms as the first parameter passed to renderSql() in the loadRenderTranslateSql() function in HelperFunctions.R and this would be a backward compatible change.

renderedSql <- renderSql(parameterizedSql[1], dbms, ...)$sql

See current code below for reference.

loadRenderTranslateSql <- function(sqlFilename,
                                   packageName,
                                   dbms = "sql server",
                                   ...,
                                   oracleTempSchema = NULL) {
  pathToSql <- system.file(paste("sql/", gsub(" ", "_", dbms), sep = ""),
                           sqlFilename,
                           package = packageName)
  mustTranslate <- !file.exists(pathToSql)
  if (mustTranslate) {
    # If DBMS-specific code does not exists, load SQL Server code and translate after rendering
    pathToSql <- system.file(paste("sql/", "sql_server", sep = ""),
                             sqlFilename,
                             package = packageName)
  }
  parameterizedSql <- readChar(pathToSql, file.info(pathToSql)$size)

  renderedSql <- renderSql(parameterizedSql[1], ...)$sql

  if (mustTranslate)
    renderedSql <- translateSql(sql = renderedSql, targetDialect = dbms, oracleTempSchema = oracleTempSchema)$sql

  renderedSql
}

schuemie commented 7 years ago

Although I understand there might be situations where platform-specific code is unavoidable, I think it should be a last resort. Could you provide the specific code that needs to be platform specific? Perhaps we can find a way to generalize it to a rule that we can add to the translation, maybe using a translation hint.

leeevans commented 7 years ago

Yes I agree, it should be a last resort. If it is required then it would be preferable to have a standard way of checking the dbms value and avoid proliferating different dbms specific flags or other workarounds.

Using a SqlRender rule would be the best approach and is recommended whenever possible.

Below is the Achilles Impala DBMS specific use case I'm working on. It would be great if there is a simple rule I could use:

Impala errors out if inserting a text literal (which are of type STRING in Impala) into a varchar(n) column. The error message is to 'avoid potential loss of precision'. An Impala STRING is essentially an unlimited length varchar.

The achilles_analysis table columns are varchar(255). The STRING text literals inserted are never larger than 255 chars but Impala still errors out.

The Achilles inserts into the achilles_analysis are failing on Impala based on the varchar(255) data types used in the achilles_analysis table DDL coded in the Achilles ohdisql (which work fine for all other DBMSs).

To correct the issue for Impala I'm looking to update the Achilles achilles_analysis create table sql code to use STRING based columns instead of varchar(255) for Impala only. So my proposal would be the following code:

{@dbms=='impala'}?{
create table @results_database_schema.ACHILLES_analysis
(
        analysis_id int,
        analysis_name STRING,
        stratum_1_name STRING,
        stratum_2_name STRING,
        stratum_3_name STRING,
        stratum_4_name STRING,
        stratum_5_name STRING
);
--:{
create table @results_database_schema.ACHILLES_analysis
 (
        analysis_id int,
        analysis_name varchar(255),
        stratum_1_name varchar(255),
        stratum_2_name varchar(255),
        stratum_3_name varchar(255),
        stratum_4_name varchar(255),
        stratum_5_name varchar(255)
);
--}

For another example of DBMS specific code, see the below link to an earlier update to Achilles.R and the achilles ohdisql which passes a PDW DBMS specific flag parameter "is_pdw" into the loadRenderTranslateSql() Achilles function

https://github.com/OHDSI/Achilles/commit/5e908fcc7ab58abfb244ebd2d4f3d614876a5192

So another option for my Impala specific DBMS use case would be to pass in another flag called "is_impala" into loadRenderTranslate() and check that. However that seems to proliferate dbms specific flags and the dbms value is already separately passed into that R function, it just isn't passed through as a @dbms variable accessible in the ohdisql.

schuemie commented 7 years ago

Ehh, what if we were to add a rule stating VARCHAR(@a) should always be translated to STRING for Impala? @tomwhite, do you see any problems with that?

For the PDW-specific code in Achilles: this seems like something that could easily be handled using a hint when creating the achilles_results table.

leeevans commented 7 years ago

@schuemie thanks for the helpful suggestions.

I wasn't aware of the ability to add hints - that's very nice. I see they are already implemented in the rules for Redshift as well as PDW!

https://github.com/OHDSI/SqlRender/blob/master/vignettes/UsingSqlRender.Rmd#optimization-for-massively-parallel-processing

Referencing @alondhe here regarding the current PDW specific code in Achilles vs switching to using hints which will also speed up Achilles on other DBMSs (e.g. Redshift). I don't think hints were available when the original code was added.

I'll await @tomwhite 's response on the impact of changing all varchar(@a)to STRING in Impala.

Tom, to add some context to this change: the downside I found for the suggested approach of setting createTable = FALSE for running Achilles on Impala and manually creating the Achilles tables is that the achilles_analysis table is not populated. The inserts into achilles_analysis are within the createTable block of Achilles sql code.

tomwhite commented 7 years ago

Translating VARCHAR to STRING would probably work fine.

gklebanov commented 7 years ago

I do have some concerns about STRING vs VARCHAR compatibility.

Reading this first and it sounds great: _"in Impala 2.0 and higher you can declare columns as VARCHAR(maxlength) or CHAR(length), but for best performance use STRING where practical."

But then finding the following:

Varchar is twice the size of String (65,535 vs. 32,767)
There are seems to be issues with STRING support for non-ASCII data. Here is an extract from Coudera manual: _For full support in all Impala subsystems, restrict string values to the ASCII character set. Although some UTF-8 character data can be stored in Impala and retrieved through queries, UTF-8 strings containing non-ASCII characters are not guaranteed to work properly in combination with many SQL aspects, including but not limited to:

String manipulation functions. Comparison operators. The ORDER BY clause. Values in partition key columns._

Unless I am misreading it, but these are quite big limitations especially considering that most of the data is stored UTF these days https://www.cloudera.com/documentation/enterprise/5-4-x/topics/impala_string.html#string

VARCHAR can be used to partition

Well, to be hones - in this particular example - Achilles - these are probably not a big constraint.

Personally, I would also go back to the original issue and understand the reasons of why it is happening:

_"Impala errors out if inserting a text literal (which are of type STRING in Impala) into a varchar(n) column. The error message is to 'avoid potential loss of precision'. An Impala STRING is essentially an unlimited length varchar.

The achillesanalysis table columns are varchar(255). The STRING text literals inserted are never larger than 255 chars but Impala still errors out."

@leeevans - Lee, is there any chance you could share that specific code? Do you explicitly cast?

tomwhite commented 7 years ago

I think that VARCHAR has the same limitations, except for partition key columns, but is that needed in OHDSI? Also, do you know if the size limitation is a real problem for Achilles or other OHDSI projects?

The other reason to map VARCHAR to STRING is that Kudu doesn't support VARCHAR, so this change would mean we can use either Parquet or Kudo for storage, which is something we are very keen to get working.

gklebanov commented 7 years ago

Reading Cloudera manual, it does not seem that VARCHAR suffers from limitations that I mentioned:

"All data in CHAR and VARCHAR columns must be in a character encoding that is compatible with UTF-8. If you have binary data from another database system (that is, a BLOB type), use a STRING column to hold it."

and all string functions and "order by" seems to be working. UTF support is definitely important.

"is that needed in OHDSI? Also, do you know if the size limitation is a real problem for Achilles or other OHDSI projects?"

The size limitation is not an issue here. I cannot say with full confidence that the rest of limitations would be a showstopper for Achilles but - as far as a broader OHDSI use case - I would say this probably would be a problem especially if we consider having any CDM data set exposed via Impala with STRING type only.

I agree - Parquet format should be used for OMOP CDM storage as a best practice. And we should continue to re-iterate it.

gklebanov commented 7 years ago

The other reason to map VARCHAR to STRING is that Kudu doesn't support VARCHAR, so this change would mean we can use either Parquet or Kudo for storage, which is something we are very keen to get working.

Personally, I would recommend that we should treat Kudu as a separate undertaking - this will be an evolution. The project is still in "incubation" mode (meaning subject to changes) while many of existing users are already familiar and use Impala with Parquet.

leeevans commented 7 years ago

Thanks @tomwhite and @gklebanov for the info.

Greg, The error is in the achilles_analysis table inserts in the existing Achilles code. Those inserts have a literal text string containing the analysis description.

Tom, the lengths of the text literals in the achilles_analysis table inserts aren't a real problem for any DBMS. Would there be an issue with using Impala partition key columns if we replace VARCHAR with STRING? I believe we will need to use that feature in OHDSI given the large size of the datasets.

Another alternative would be to just update Achilles to explicitly cast all the text literals in the achilles_analysis table inserts to varchar(255) which should work for all DBMSs. That's a simple change.

gklebanov commented 7 years ago

Another alternative would be to just update Achilles to explicitly cast all the text literals in the achilles_analysis table inserts to varchar(255) which should work for all DBMSs. That's a simple change.

That sounds like a good low impact step to me. Casting is always a good practice anyway

leeevans commented 7 years ago

Yes, I'm going to take the casting approach in Achilles and close this issue.

OHDSI / SqlRender

Pass through dbms value as an @parameter so it is available within ohdisql code #82