OHDSI / Achilles

Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (ACHILLES) - descriptive statistics about a OMOP CDM database
https://ohdsi.github.io/Achilles/
130 stars 121 forks source link

Achilles Execution Fails On Error #435

Closed ayo-uda closed 3 years ago

ayo-uda commented 5 years ago

Expected behavior

The generation of summary and results table data for a OMOP v5.3 CDM instance sourced from BigQuery after executing achilles.

Actual behavior

An 'Error getting job status' that pops up while attempting to merge created achilles result tables several minutes into the execution:

image

image

The interrupted execution leaves multiple temp tables remaining in the bq dataset designated as the scratch database schema.

Steps to reproduce behavior

  1. Creating a connection detail
  2. Executing achilles
install.packages('rJava', repos='https://cloud.r-project.org/')
install.packages('devtools' , repos='https://cloud.r-project.org/')
library(devtools)
install_github("ohdsi/SqlRender")
install_github("ohdsi/DatabaseConnector")
library(DatabaseConnector)
install_github("ohdsi/Achilles")
library(Achilles)

dbms <- "bigquery"
connectionString <- "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId={ProjectId};OAuthType=0;OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};Timeout=10000000"
connectionDetails <- createConnectionDetails(
  dbms=dbms,
  connectionString=connectionString,
  user="",
  password="",
  pathToDriver={pathToDriver})

achilles(
  connectionDetails,
  cdmDatabaseSchema="exampleDataset",
  resultsDatabaseSchema="resultsDataset",
  scratchDatabaseSchema="resultsDataset",
  vocabDatabaseSchema="exampleDataset",
  numThreads=10,
  sourceName="BigQuery",
  cdmVersion="5.3",
  runHeel=TRUE,
  smallCellCount=NULL,
  runCostAnalysis=FALSE,
  createIndices=TRUE,
  createTable=TRUE,
  verboseMode=TRUE)
ayo-uda commented 5 years ago

I believe I have pinpointed the problem. The query made by the calling of merge_achilles_tables.sql (snipped provided below) within the achilles.R script gives rise to the introduction of an **invalid character(0) column** throughout the rendered SQL query that is causing the error. I have provided examples of the rendered query and the resulting error message below.

Rendered Query Snippet

SQL:
--HINT DISTRIBUTE_ON_KEY(analysis_id)

CREATE TABLE resultsDataset.achilles_results

 AS
SELECT
analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, stratum_5, count_value

FROM
(
  select cast(analysis_id  as int64) as analysis_id, cast(stratum_1 as string) as stratum_1, cast(stratum_2 as string) as stratum_2, cast(stratum_3 as string) as stratum_3, cast(stratum_4 as string) as stratum_4, cast(stratum_5 as string) as stratum_5, cast(count_value  as int64) as count_value from 
                                     resultsDataset.tmpach_0 union all select 2000000 as analysis_id, '13.638783 secs' as stratum_1, null as stratum_2, null as stratum_3, null as stratum_4, null as stratum_5, character(0) 
union all
 select cast(analysis_id  as int64) as analysis_id, cast(stratum_1 as string) as stratum_1, cast(stratum_2 as string) as stratum_2, cast(stratum_3 as string) as stratum_3, cast(stratum_4 as string) as stratum_4, cast(stratum_5 as string) as stratum_5, cast(count_value  as int64) as count_value from 
                                     resultsDataset.tmpach_1 union all select 2000001 as analysis_id, '11.222780 secs' as stratum_1, null as stratum_2, null as stratum_3, null as stratum_4, null as stratum_5, character(0) ...

Error Message

Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "location" : "q",
    "locationType" : "parameter",
    "message" : "Function not found: character at [13:225]",
    "reason" : "invalidQuery"
  } ],
  "message" : "Function not found: character at [13:225]",
  "status" : "INVALID_ARGUMENT"
}

achilles.R Code Snippet Associated With Error

  if (!sqlOnly & includeRawCost) {
    # obtain the runTime for this analysis

    benchmarkSqls <- lapply(c(15000, 16000), function(rawCostId) {
      runTime <- .getAchillesResultBenchmark(rawCostId, outputFolder)  

      benchmarkSelects <- lapply(resultsTable$schema$FIELD_NAME, function(c) {
        if (tolower(c) == "analysis_id") {
          sprintf("%d as analysis_id", .getBenchmarkOffset() + rawCostId)
        } else if (tolower(c) == "stratum_1") {
          sprintf("'%s' as stratum_1", runTime)
        } else if (tolower(c) == "count_value") {
          sprintf("%d as count_value", smallCellCount + 1) 
        } else {
          sprintf("NULL as %s", c)
        }
      })
      SqlRender::render(sql = "select @benchmarkSelect", benchmarkSelect = paste(benchmarkSelects, collapse = ", "))
    })
    benchmarkSql <- paste(benchmarkSqls, collapse = " union all ")
    benchmarkSql <- SqlRender::translate(sql = benchmarkSql, targetDialect = connectionDetails$dbms)
    detailSqls <- c(detailSqls, benchmarkSql)
  }

  SqlRender::loadRenderTranslateSql(sqlFilename = "analyses/merge_achilles_tables.sql",
                                    packageName = "Achilles",
                                    dbms = connectionDetails$dbms,
                                    warnOnMissingParameters = FALSE,
                                    createTable = createTable,
                                    resultsDatabaseSchema = resultsDatabaseSchema,
                                    detailType = resultsTable$detailType,
                                    detailSqls = paste(detailSqls, collapse = " \nunion all\n "),
                                    fieldNames = paste(resultsTable$schema$FIELD_NAME, collapse = ", "),
                                    smallCellCount = smallCellCount)
}
t-abdul-basser commented 5 years ago

We are experiencing the same issue against on SQL Server. #433

djb2188 commented 4 years ago

@t-abdul-basser @alondhe @cukarthik I am finding the following issues which would certainly break the Achilles/SQLRender process down.

select analysis_id, 
cast(null as varchar(255)) as stratum_1, cast(null as varchar(255)) as stratum_2, cast(null as varchar(255)) as stratum_3, 
**cast(null as varchar(255)) as stratum_18**, cast(null as varchar(255)) as stratum_5,
count_value, min_value, max_value, avg_value, stdev_value, median_value, p10_value, p25_value, p75_value, p90_value
into cdm_mdcr.results.tmpach_dist_1803
from `#tempResults_1803`
t-abdul-basser commented 4 years ago

Thanks @djb2188. Interesting. Please submit a PR when ready so that we can review these changes.

Andreyyiv commented 4 years ago

@alondhe @t-abdul-basser The character(0) string is produced by this code sprintf("%d as count_value", smallCellCount + 1) if the value of smallCellCount is set to NULL per the package documentation to disable masking the actual counts when it is not needed. A potential work-around might be to set smallCellCount = 0 instead of NULL.

fdefalco commented 3 years ago

This last comment is accurate. I will update the documentation to note that smallCellCount should be set to 0 instead of NULL to close this issue.