OHDSI / Andromeda

AsynchroNous Disk-based Representation of MassivE DAta: An R package aimed at replacing ff for storing large data objects.
https://ohdsi.github.io/Andromeda/
11 stars 9 forks source link

Use of Arrow within CohortDiagnostics causes issue #43

Open azimov opened 1 year ago

azimov commented 1 year ago

Errors when using the develop branch of CohortDiagnostics occur. I'm not sure if this is Andromeda or the downstream use of FeatureExtraction.

Error (test-againstCdm.R:24:3): Cohort diagnostics in incremental mode
Error: Invalid: Failed to parse value: FEMALE
Backtrace:
  1. base::system.time(...)
       at test-againstCdm.R:24:2
  2. CohortDiagnostics::executeDiagnostics(...)
  5. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
       at CohortDiagnostics/R/RunDiagnostics.R:715:8
  8. CohortDiagnostics::runCohortTimeSeriesDiagnostics(...)
       at CohortDiagnostics/R/TimeSeries.R:575:12
 10. Andromeda (local) `$<-`(`*tmp*`, timeSeries, value = `<FlSystmD[,15]>`)
 12. Andromeda (local) `[[<-`(`*tmp*`, name, value = `<FlSystmD[,15]>`)
 13. Andromeda (local) .local(x, i, ..., value)
 14. arrow::write_dataset(value, file.path(attr(x, "path"), i), format = "feather")
 15. plan$Write(...)
 16. arrow:::ExecPlan_Write(...)

Error (test-moduleTimeSeries.R:86:5): Testing cohort time series execution
Error: Invalid: Failed to parse value: MALE
Backtrace:
  1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
       at test-moduleTimeSeries.R:86:4
  4. CohortDiagnostics::runCohortTimeSeriesDiagnostics(...)
       at CohortDiagnostics/R/TimeSeries.R:575:12
  6. Andromeda (local) `$<-`(`*tmp*`, timeSeries, value = `<FlSystmD[,15]>`)
  8. Andromeda (local) `[[<-`(`*tmp*`, name, value = `<FlSystmD[,15]>`)
  9. Andromeda (local) .local(x, i, ..., value)
 10. arrow::write_dataset(value, file.path(attr(x, "path"), i), format = "feather")
 11. plan$Write(...)
 12. arrow:::ExecPlan_Write(...)

Error (test-moduleTimeSeries.R:291:5): Testing Data source time series execution
Error: Invalid: Failed to parse value: MALE
Backtrace:
  1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
       at test-moduleTimeSeries.R:291:4
  4. CohortDiagnostics::runCohortTimeSeriesDiagnostics(...)
       at CohortDiagnostics/R/TimeSeries.R:639:8
  6. Andromeda (local) `$<-`(`*tmp*`, timeSeries, value = `<FlSystmD[,15]>`)
  8. Andromeda (local) `[[<-`(`*tmp*`, name, value = `<FlSystmD[,15]>`)
  9. Andromeda (local) .local(x, i, ..., value)
 10. arrow::write_dataset(value, file.path(attr(x, "path"), i), format = "feather")
 11. plan$Write(...)
 12. arrow:::ExecPlan_Write(...)

Steps to reproduce:

Note that this occurs in the time series diagnostics as well as other places.

Running the time series diagnostics and breaking at line #441:

Lets you find an error of this type.

Here running resultsInAndromeda$allData %>% dplyr::collect() produces something similar.

Error in compute.Dataset(x) : Invalid: Failed to parse value: MALE

Note that gender in the SQL is taken directly from concept.concept_name a varchar/text string by definition.

Likely also reproducible when calling runCohortTimeSeriesDiagnostics directly.

azimov commented 1 year ago

Adding more to this issue:

This error also occurs.

Error (test-moduleCohortRelationship.R:363:5): Testing cohort relationship logic - incremental FALSE
Error: Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(timeId) of type double and FieldRef.Name(timeId) of type int32
Backtrace:
  1. CohortDiagnostics::runCohortRelationshipDiagnostics(...)
       at test-moduleCohortRelationship.R:363:4
  3. Andromeda (local) `$<-`(`*tmp*`, cohortRelationships, value = `<arrw_dp_[,38]>`)
  5. Andromeda (local) `[[<-`(`*tmp*`, name, value = `<arrw_dp_[,38]>`)
  6. Andromeda (local) .local(x, i, ..., value)
  7. arrow::write_dataset(value, tempDir, format = "feather")
  8. plan$Build(dataset)
  9. self$Build(.data$.data)
 10. node$Join(...)
 12. arrow:::ExecNode_Join(...)

This is likely with a joing using dplyr here.

esultsInAndromeda$cohortRelationships <-
      resultsInAndromeda$cohortRelationships %>%
      dplyr::inner_join(resultsInAndromeda$timePeriods, by = "timeId") %>%
      dplyr::select(-timeId) %>%
      dplyr::arrange(
        cohortId,
        comparatorCohortId,
        startDay,
        endDay
      )

The table cohortRelationships has a time id field that is literally,

SELECT ..., @time_id time_id ...

Looping over values from timePeriods which is assigned via the user of row_number in R.

casting time_id as int in the sql may fix this, however, I expect that being able to dynamically join to compatible types would be a desired feature as dbdplyr doesn't expect explicit type casting.

ablack3 commented 1 year ago

Thanks for reporting this!

ablack3 commented 1 year ago

@anthonysena Have you tested FeatureExtraction with the develop branch of Andromeda?

anthonysena commented 1 year ago

Not yet @ablack3 - trying to get the unit tests back online and then see how FE is working with the develop branch of Andromeda

azimov commented 1 year ago

@ablack3 so following up on the CohortDiagnostics Results i was able to resolve the initial issues by casting explicitly to character or integer in SQL when querying to Andromeda, forcing the arrow object to store empty values as the correct type. Though this works I'd still note that the joins between different data types is something that will happen and could shoot us in the foot - it could be that non of our unit tests produce this situation so a bug is still lying around that will go unnoticed (e.g. a left join to an empty tibble could be enough to break things). If there was a way to resolve this internally I think it would be worthwhile.

However, when running with the latest version I get a new strange error. The problem is that there are multiple rows keys in a results object that should be generated from a group by select. (This doesn't happen in the current released version of Andromeda) and I can't figure out why.

Warning (test-5-moduleTimeSeries.R:86:5): Testing cohort time series execution
Default behavior of `pull()` on Arrow data is changing. Current behavior of returning an R vector is deprecated, and in a future release, it will return an Arrow `ChunkedArray`. To control this:
i Specify `as_vector = TRUE` (the current default) or `FALSE` (what it will change to) in `pull()`
i Or, set `options(arrow.pull_as_vector)` globally
This warning is displayed once every 8 hours.
Backtrace:
 1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
      at test-5-moduleTimeSeries.R:86:4
 7. arrow:::pull.arrow_dplyr_query(., n)
 8. arrow:::handle_pull_as_vector(out, as_vector)

Error (test-5-moduleTimeSeries.R:86:5): Testing cohort time series execution
Error in `makeDataExportable(x = data, tableName = "time_series", minCellCount = minCellCount, 
    databaseId = databaseId)`:  - duplicates found in primary key for table time_series. The primary keys are: cohortId, databaseId, periodBegin, periodEnd, seriesType, calendarInterval, gender, ageGroup
Backtrace:
 1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
      at test-5-moduleTimeSeries.R:86:4
 2. CohortDiagnostics:::makeDataExportable(...)
      at CohortDiagnostics/R/TimeSeries.R:590:8

Error (test-5-moduleTimeSeries.R:291:5): Testing Data source time series execution
Error in `makeDataExportable(x = data, tableName = "time_series", minCellCount = minCellCount, 
    databaseId = databaseId)`:  - duplicates found in primary key for table time_series. The primary keys are: cohortId, databaseId, periodBegin, periodEnd, seriesType, calendarInterval, gender, ageGroup
Backtrace:
 1. CohortDiagnostics:::executeTimeSeriesDiagnostics(...)
      at test-5-moduleTimeSeries.R:291:4
 2. CohortDiagnostics:::makeDataExportable(...)
      at CohortDiagnostics/R/TimeSeries.R:652:4

The above is a check that is performed when exporting data from the package - this is to ensure that the results data can be inserted into the result schemas for viewing in the shiny app etc.

To reproduce:

  1. Clone CohortDiagnostics git
  2. checkout develop branch
  3. Working - run unit tests with Sqlite Andromeda (version 0.6.3)
  4. Not working (produces above errors) - install Andromeda develop branch, run unit tests

The offending code is around here. This wasn't written by me (and I find it kind of confusingly written so this might be a tough one to work out) - it may be just an error with the combinations of Andromeda objects. However, the SQL that generates the results looks fine - it's using a GROUP BY clause so the rows in question should be aggregated (and indeed, are in the Sqlite version).

Either CohortDiagnostics is doing something bad when combining the objects like this or something is going wrong when exporting the data. I can't quite figure out why that might be, and why it works fine when using the Sqlite version.