Incorrect SQL translation with `generateCohortSet` when using `spark` as dbms

scheah92 commented 1 week ago

As far as I'm aware, generateCohortSet uses SqlRender to translate the source query into the correct syntax. This includes interpolating schema names and table names where necessary. I am using Databricks aka spark as my dbms, but I am unable to run generateCohortSet due to an SQL error.

Firstly, as a reminder, when using spark as dbms with DatabaseConnector, it is necessary to specify the catalog name inside the connection string because there aren't any parameters yet to pass in catalog names for Databricks connections. Once catalog name is set in the connection string, DatabaseConnector can continue to use the supplied schema name parameters as required.

For this example, assume that

catalog <- "dqd_test_staging"
vocabulary_database_schema <- "dqd_24r3_cdm"

Take a look at the first bit of the source query for a cohort I'm creating.

CREATE TABLE #Codesets (
  codeset_id int NOT NULL,
  concept_id bigint NOT NULL
)
;

INSERT INTO #Codesets (codeset_id, concept_id)
SELECT 0 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
  select concept_id from @vocabulary_database_schema.CONCEPT where concept_id in (1118084)
)

generateCohortSet first calls SqlRender::render which produces this with interpolated schema names

CREATE TABLE #Codesets (
  codeset_id int NOT NULL,
  concept_id bigint NOT NULL
)
;

INSERT INTO #Codesets (codeset_id, concept_id)
SELECT 0 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
  select concept_id from dqd_24r3_cdm.CONCEPT where concept_id in (1118084)
)

As you can see, @vocabulary_database_schema was correctly interpolated, producing a valid table identifier with the <schema>.<table> naming convention, without the <catalog> name prepended because it is already assumed from the connection string.

It then calls SqlRender::translate with targetDialect='spark' which then produces

CREATE TABLE yz8flonqCodesets  
USING DELTA
 AS
SELECT
CAST(NULL AS int) AS codeset_id,
    CAST(NULL AS bigint) AS concept_id  WHERE 1 = 0;
INSERT INTO yz8flonqCodesets (codeset_id, concept_id)
SELECT 0 as codeset_id, c.concept_id FROM (select distinct I.concept_id FROM
( 
 select concept_id from dqd_24r3_cdm.CONCEPT where concept_id in (1118084)
)

This will now produce an error because now spark is trying to create the table yz8flonqCodesets that uses the <table> naming convention, which means spark tries to fill in the missing <schema> name by using the value default, which does not exist in my catalog. This is the following error message that occurs.

Error running query: [SCHEMA_NOT_FOUND] 
    org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: [SCHEMA_NOT_FOUND] 
    The schema `dqd_test_staging.default` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a catalog, verify the current_schema() output, or qualify the name with the correct catalog. To tolerate the error on drop use DROP SCHEMA IF EXISTS.

From what I can see, it seems all the # placeholders like #Codesets, #qualified_events, #final_cohort should also be prepended by a schema placeholder, like @cohortDatabaseSchema.

Please let me know if I am using this function wrongly somehow, otherwise could somebody point me to a suitable workaround for spark? Should I just create a default schema for now?

anthonysena commented 1 week ago

Hi - you've got it right about the generateCohortSet function - under the covers, it is using SqlRender as you described. Can you confirm you are using the latest version of SqlRender? I think the behavior you have described will still be the same but worth confirming just to be sure.

I did a quick check of the query behavior using https://data.ohdsi.org/SqlDeveloper/ and noted the same query rendering. I'm not really familiar with Spark/Databricks so I'm tagging a few folks that might be able to chime in and help: @fdefalco @greshje

scheah92 commented 1 week ago

Yes, I have SqlRender version 1.18 installed.

OHDSI / CohortGenerator

Incorrect SQL translation with `generateCohortSet` when using `spark` as dbms #169