OHDSI / WebAPI

OHDSI WebAPI contains all OHDSI services that can be called from OHDSI applications
Apache License 2.0
126 stars 156 forks source link

Atlas Characterizations, Cohort Pathways & Cohort Reports not working on Spark (Databricks) #2248

Closed TomWhite-MedStar closed 1 year ago

TomWhite-MedStar commented 1 year ago

Expected behavior

Three Atlas functions that work fine on SQL Server are not working on Spark: Characterizations, Cohort Pathways, and Cohort Reports (Heracles reports). Characterizations worked fine in Spark/Databricks in the 2.11 release of Atlas and WebAPI, but do not work in 2.12.1

Actual behavior

Each of those types of analyses fail - usually after significant runtime. In each case the root cause appears to be SqlRender error in substituting and . See issue # 330 in SqlRender for more detail. There is also discussion on this forum post.

Steps to reproduce behavior

Characterizations

Load the attached json into Chacterizations. Select Executions and click Generate Lipitor Users vs Zocor Users.json.txt There are multiple queries like show columns in #analysis_ref where the temporary table names are not substituted to something like show columns in tmp.analysis_ref. However, those queries do not cause the job to fail. Instead, it fails the first time there is a unprocessed substitution. The error file is below. Lipitor Users vs Zocor Users.error.txt

Cohort Pathways

Load the attached json into Cohort Pathways. Select Executions, and click Generate Diabetic Treatment Pathway.json.txt The error message generated is attached. Diabetic Treatment Pathway.error.txt

Cohort / Heracles Reports.

Load the attached json cohort. Generate the cohort. Then, under Cohort Reporting, select that source and choose Quick Analysis [COVID HCQ ID62 v1] New users of sulfasazine with prior rheumatoid arthritis.json.txt The Heracles reports after a few seconds. The SQL query that caused the error is attached. Cohort-Heracles.error.txt

anthonysena commented 1 year ago

@TomWhite-MedStar - I'm going to close this for now so we can track this work on the SqlRender issue. If needed, we can transfer the SqlRender issue to this repo and re-visit the information you have provided in this issue (which is excellent by the way!).

TomWhite-MedStar commented 1 year ago

@anthonysena, we upgraded to Atlas 2.13, and I can confirm that all of the above-listed issues are now fixed.

However, when I ran the Heracles Full Analysis it failed partway through. Here is the generated Spark SQL that failed: Cohort-Heracles-full.sql.error.txt

The error message is:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'into'.(line 8, pos 0)

Should I continue to track such issue here, or only in the SqlRender thread?

anthonysena commented 1 year ago

Excellent - let's track this work here. I'll close out the SqlRender issue that is now linked. Let me see if I can track down the query causing the issue and go from there.

anthonysena commented 1 year ago

@TomWhite-MedStar you mentioned the error message is:

[PARSE_SYNTAX_ERROR] Syntax error at or near 'into'.(line 8, pos 0)

However, I see no mention of an 'into' statement in the SQL you provided:

select
  c1.cohort_definition_id,
  200 as analysis_id,
  --
  --
  vo1.visit_CONCEPT_ID as stratum_1,
  --
  cast('' as STRING) as stratum_2,
  cast('' as STRING) as stratum_3,
  cast('' as STRING) as stratum_4,
  COUNT(distinct vo1.person_id) as count_value into tmp_v0125_v2.sl527h4fresults_200
from
  omop_160101_to_221231_v0125.visit_occurrence vo1
  inner join tmp_v0125_v2.sl527h4fHERACLES_cohort c1 on vo1.person_id = c1.subject_id --
group by
  c1.cohort_definition_id,
  --
  --
  vo1.visit_CONCEPT_ID --;
  CREATE TABLE tmp_v0125_v2.sl527h4fresults_500 USING DELTA AS
SELECT
  c1.cohort_definition_id,
  500 as analysis_id,
  d1.cause_CONCEPT_ID as stratum_1,
  cast('' as STRING) as stratum_2,
  cast('' as STRING) as stratum_3,
  cast('' as STRING) as stratum_4,
  COUNT(distinct d1.PERSON_ID) as count_value
FROM
  omop_160101_to_221231_v0125.death d1
  inner join tmp_v0125_v2.sl527h4fHERACLES_cohort c1 on d1.person_id = c1.subject_id --
group by
  c1.cohort_definition_id,
  d1.cause_CONCEPT_ID

Is this SQL from the WebAPI log itself? I think I may be missing some of the SQL that may have caused the issue here.

For reference, I believe this is the SQL script that was used to generate the 1st part of the SQL above: https://github.com/OHDSI/WebAPI/blob/a1a130a9606c0318ca3b69255b1aee50b39ab56d/src/main/resources/resources/cohortanalysis/heraclesanalyses/sql/200_201.sql

TomWhite-MedStar commented 1 year ago

@anthonysena , the INTO statement is hidden in the formatting on the line before the first FROM statement:

COUNT(distinct vo1.person_id) as count_value **into** tmp_v0125_v2.sl527h4fresults_200

When I run the code you linked above through SqlRender, I get a different output from the above.

Might it be a SQL splitting issue? I see from this line above: vo1.visit_CONCEPT_ID --; that there is a semi-colon after a comment. The full SQL I posted above is what was sent to Databricks, so it appears that SQLRender didn't recognize that there are two SQL statements above.

Which code does the substitutions in sections like this?: --{@CDM_version == '4'}?{ vo1.place_of_service_CONCEPT_ID --} --{@CDM_version == '5'}?{ vo1.visit_CONCEPT_ID --} ; Could that be leading to cases like shown on the line with the commented semi-colon?

anthonysena commented 1 year ago

@anthonysena , the INTO statement is hidden in the formatting on the line before the first FROM statement:

COUNT(distinct vo1.person_id) as count_value into tmp_v0125_v2.sl527h4fresults_200

In the words of Homer Simpson: d'oh! Sorry I missed that.

There does appear to be some type of rendering issue based on what I am seeing in SqlDeveloper. Here is what the SQL mentioned above looks like when translated to Spark:

image

For reference, here is the rendered Spark SQL:

select c1.cohort_definition_id, 200 as analysis_id,
 --
 --
 vo1.visit_CONCEPT_ID as stratum_1,
 --
 cast( '' as STRING ) as stratum_2, cast( '' as STRING ) as stratum_3, cast( '' as STRING ) as stratum_4,
 COUNT(distinct vo1.person_id) as count_value
into temp_em.k6raaf7fresults_200
from
cdm.visit_occurrence vo1
inner join temp_em.k6raaf7fHERACLES_cohort c1
on vo1.person_id = c1.subject_id
--
WHERE vo1.visit_start_date>=c1.cohort_start_date and vo1.visit_end_date<=c1.cohort_end_date
--
group by c1.cohort_definition_id,
--
--
vo1.visit_CONCEPT_ID
--;

As you noted above, the trailing semi-colon is commented out so Spark is likely treating that entire SQL block as 1 statement thus causing the error. This appears to be unique to the Spark translation when comparing the same query translated to PostgreSQL for comparison:

image

For reference, here is the rendered PostgreSQL:

-- 200                'My analysis'
--insert into results_schema.heracles_results (cohort_definition_id, analysis_id, stratum_1, count_value)
CREATE TEMP TABLE results_200
AS
SELECT
c1.cohort_definition_id, 200 as analysis_id,
  --
  --
                                vo1.visit_CONCEPT_ID as stratum_1,
  --
                                cast( '' as varchar(1) ) as stratum_2, cast( '' as varchar(1) ) as stratum_3, cast( '' as varchar(1) ) as stratum_4,
                                COUNT(distinct vo1.person_id) as count_value
FROM
cdm.visit_occurrence vo1
inner join HERACLES_cohort c1
on vo1.person_id = c1.subject_id
--
WHERE vo1.visit_start_date>=c1.cohort_start_date and vo1.visit_end_date<=c1.cohort_end_date
--
group by c1.cohort_definition_id,
--
--
vo1.visit_CONCEPT_ID
--
;
ANALYZE results_200
;

Tagging @schuemie for input - this now seems like it may be an issue w/ SqlRender whereby a new line is required in one of the replacement patterns? Alternatively, we could review the WebAPI query and perhaps some re-formatting in that file would fix things?

TomWhite-MedStar commented 1 year ago

@chrisknoll , I submitted a pull request before noticing that you may have already fixed query 200_201.sql.

Silly question - where do I find the control file that determines the order in which those heraclesanalyses are run? I monitored the generated queries, and they are not in an obvious order. If you can let me know what sequence of queries are called for Full Analysis after 200, I can test the remaining ones for potential issues on Databricks / Spark.

chrisknoll commented 1 year ago

Sorry, I saw this after I saw the PR, so the PR is approved and merged (the changes looked good to me).

I need to look at the code again and remember how the heracles reports are generated....I'll reply here when I work it out.