OHDSI / Strategus

[Under development] An R packages for coordinating and executing analytics using HADES modules
https://ohdsi.github.io/Strategus/
6 stars 11 forks source link

JSON Serialization problems using ParallelLogger #84

Open chrisknoll opened 1 year ago

chrisknoll commented 1 year ago

I noticed a mismatch in the ParallelLogger::saveSettingsToJson() vs. the JSON that was produced via the irDesign$asJSON() from the CohortIncidence package.

Steps to reproduce: Create an IR design with 1 analysis containing >1 targets, but exactly one TAR and Outcome. Test code:

targets <- lapply(c(1:10), function(targetCohortId) {
  return (CohortIncidence::createCohortRef(id=targetCohortId, name=paste0("Cohort ", targetCohortId)))
})

outcomes <- lapply(c(1), function(outcomeId) {
  return (CohortIncidence::createOutcomeDef(id=outcomeId, name=paste0("Outcome ", outcomeId), cohortId=outcomeId, cleanWindow=335))
})

tars <- list(CohortIncidence::createTimeAtRiskDef(id=1, startWith="start", startOffset = -7, endWith="end", endOffset = 14))

analysis1 <- CohortIncidence::createIncidenceAnalysis(targets = sapply(targets, function(t) { return(t$id);}),
                                                      outcomes = sapply(outcomes, function(o) { return(o$id);}),
                                                      tars = sapply(tars, function(t) { return(t$id);}));

irDesign <- CohortIncidence::createIncidenceDesign(targetDefs = targets,
                                                   outcomeDefs = outcomes,
                                                   tars=tars,
                                                   analysisList = list(analysis1),
                                                   strataSettings = CohortIncidence::createStrataSettings(byGender=T, byAge=T, ageBreaks = c(18,50,65)))
irDesign$asJSON(pretty=T)

ParallelLogger::saveSettingsToJson(irDesign$toList(), file.path("testSerialize.json"))

The call to irDesign$asJSON(pretty=T) yields this JSON fragment:

  "analysisList": [
    {
      "targets": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
      "outcomes": [1],
      "tars": [1]
    }
  ],

The testSerialize.json that ParallelLogger created contains:

  "analysisList": [
    {
      "targets": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
      "outcomes": 1,
      "tars": 1
    }
  ],

Note: outcomes and tars are supposed to be collections (arrays), not single elements. The reason why this works for Strategus is that the CohortIncidenceModule will read in this JSON using the R6 class and properly check for types, and store in a list (and will serialize it out to the underlying CohortIncidence module in the correct form). The reason why this might work more generally is that R doesn't seem to know the difference between a single valued variable vs. a vector of size 1 (it treats it as the same thing). However, while that might be fine in R, if you pass this JSON to a parser that expects outcomes and tars to be arrays, it will fail.

One solution is to employ some R function (or R6 classes) where we can control the JSON serialization. or pressure R package maintainers for json to 'do the right thing'.

Additional Reading: The plea to jsonlite devs asking them to address this behavior. Discussion about using R6 classes with background on CohortIncidence choices.

anthonysena commented 1 year ago

Thanks @chrisknoll for writing this up. Just to focus on this part for the issue:

However, while that might be fine in R, if you pass this JSON to a parser that expects outcomes and tars to be arrays, it will fail.

So the difference in serialization as demonstrated between CohortIncidence and ParallelLogger could potentially cause issues for the modules if they do not handle collections vs. single elements correctly and pass to a JSON parser other than jsonlite? Do I have this right? My expectation is that people using Strategus should use ParallelLoggger for their JSON serialization/deserialization operations. The idiosyncrasies described here are then a responsibility of the module to handle as you've done with CohortIncidence.

chrisknoll commented 1 year ago

Hi, @anthonysena:

The demonstration I offered between CI and PL was just to demonstrate the difference in serialization. As far as where it would cause issues: if you are parsing the JSON in an R context (via jsonlite, or some other R package), you won't run into issues because R doesn't see a difference between a singular value of 1 and an array containing 1.

> x<- c(1)
> x
[1] 1
> x[1]
[1] 1
> x[1][1]
[1] 1

So, if you get some JSON and parse it in an R context, and then read a value from the elements, whether it was sourced from a scalar variable or array variable, looks all the same to R!

Where this gets bad is if you take R-parsed JSON and then serialize it out over to some other context like a parameter to CIRCE or some other Python call that actually cares about the structure of the JSON (I believe Python knows the difference between a scalar variable and an array variable). In this context, I believe you get into trouble.

This lends to a bit of the justification on serializing a JSON string as a string instead of the object-nodes (discussed here): If you just encode it as a string, then the parser won't screw it up as it parses the child elements and ignores the array-or-not context of the fields.