OHDSI / Achilles

Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (ACHILLES) - descriptive statistics about a OMOP CDM database
https://ohdsi.github.io/Achilles/
128 stars 116 forks source link

Achilles Generating Incorrect Column Name in SQL Scripts - vo.discharge_to_concept_id #757

Open Zachary-Higgins opened 3 months ago

Zachary-Higgins commented 3 months ago

Hi - We're setting up Achilles to run on top of our delta lake through Databricks. I was having issues running the analysis through R, so I generated the SQL scripts and copied them over to Databricks. The generated script failed to complete when referencing the discharge_to_concept_id field because it was missing. In actuality, it looks like it changed from discharge to discharged.

https://ohdsi.github.io/CommonDataModel/cdm54.html#visit_occurrence

We're using OMOP CDM 5.4.1. It looks like this issue was reported last year in May, but has been closed and is still present. https://github.com/OHDSI/Achilles/issues/702

Here's the r script we used to generate the queries. I think it's pretty standard....

# Install Packages
if (!require("remotes")) install.packages("remotes")
remotes::install_github("OHDSI/Achilles", force=TRUE)
install.packages("DatabaseConnector")
library("DatabaseConnector")

# Download Drivers
Sys.setenv(DATABASECONNECTOR_JAR_FOLDER = "c:\\temp")
downloadJdbcDrivers("spark")

# Set Connection
connectionDetails <- createConnectionDetails(
  dbms="spark", 
  connectionString="",
  user="token", 
  password="")

options(connectionObserver = NULL)

library(Achilles)
Achilles::achilles(
  cdmVersion = "5.4.1", 
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "REMOVED",
  resultsDatabaseSchema = "REMOVED",
  dropScratchTables = TRUE,
  createIndices = FALSE,
  createTable = TRUE,
  numThreads = 10,
  sqlOnly = TRUE
)

And an example of the incorrect column reference in the generated SQL.

SELECT
1203 AS analysis_id,
 CAST(vo.discharge_to_concept_id AS STRING) AS stratum_1,
 CAST(NULL AS STRING) AS stratum_2,
 CAST(NULL AS STRING) AS stratum_3,
 CAST(NULL AS STRING) AS stratum_4,
 CAST(NULL AS STRING) AS stratum_5,
 COUNT(*) AS count_value
FROM
hive_metastore.omop.visit_occurrence vo
JOIN 
 hive_metastore.omop.observation_period op 
ON 
 vo.person_id = op.person_id
AND 
 vo.visit_start_date >= op.observation_period_start_date
AND 
 vo.visit_start_date <= op.observation_period_end_date
WHERE 
 vo.discharge_to_concept_id != 0
GROUP BY 
 vo.discharge_to_concept_id;
OPTIMIZE hive_metastore.omop.bajhxbnws_tmpach_1203
 ZORDER BY stratum_1;
Zachary-Higgins commented 3 months ago

It also generated the following incorrect column reference:

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name admitting_source_concept_id cannot be resolved. Did you mean one of the following? [visit_source_concept_id, admitted_from_concept_id, visit_type_concept_id, discharged_to_concept_id, visit_concept_id].; line 65 pos 197

Assuming admitting_source_concept_id should actually be admitted_from_concept_id.

Should we create a new issue for each or keep a list here?