Circular references and large schemas are problematic

rubberduck203 commented 3 years ago

The FHIR JSON schema contains many circular references, notably Element and Extension, which are fundamental building blocks for all FHIR resources but there are others, such as Identifier and Reference.

If we transpile the FHIR JSON schema to AVRO as one giant *.avsc file, the circular references are correctly inlined, but the AVRO tooling chokes in several ways on the 110k line schema.

The gradle avro plugin straight up hangs and never completes.
avro-tools compile schema generates Java source code that does not compile.

If we transpile the FHIR JSON schema into many AVRO *.avsc files, we run into problems with the circular references. AVRO allows for circular references, but they must be inlined.

The gradle avro plugin can not generate the multiple files because it cannot properly resolve the record dependencies.
The maven avro plugin does better, but we must manually specify the dependency ordering.
In order to get the maven avro plugin to work, we must manually edit the generated schemas in order to inline circular dependencies.

The options as I see it now are:

Try to make avro-to-json-schema detect circular dependencies and inline just those instead of everything.
Use avro-to-json-schema to generate a "first pass" avro schema, then take ownership of the files from there. Manually editing and curating them.

I'm unsure of how possible option 1 is. Maybe there are some papers we could find on this kind of circular reference detection that would shed some light on how hard it would be to accomplish. If we could make that work, the dream of simply generating the avro schemas from the official HL7 schema might become a reality.

Option 2 is less desirable. When doing code generation, its usually best if you don't edit the generated files, so you can take advantage of changes in the source input and improvements in the generator tool itself. With that said, it's a better options than trying to hand write avro schemas for all 681 FHIR resources. At least we could start with a reasonably solid base very quickly.

It should be noted that we have discovered some places where the FHIR JSON schema does not match the FHIR spec. For example Questionnaire has a field called enableWhen.answer[x] which is a choice type. However, the JSON schema does not use a oneOf to define a union (schema snippet below). Taking ownership of the schemas would allow us to correct any such discrepancies until the official JSON schema can be corrected.

"Questionnaire_EnableWhen": {
     "comment": "Truncated for relevancy",
      "properties": {
        "answerBoolean": {
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "pattern": "^true|false$",
          "type": "boolean"
        },
        "answerDecimal": {
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "pattern": "^-?(0|[1-9][0-9]*)(\\.[0-9]+)?([eE][+-]?[0-9]+)?$",
          "type": "number"
        },
        "answerInteger":
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "pattern": "^-?([0]|([1-9][0-9]*))$",
          "type": "number"
        },
        "answerDate": {
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "pattern": "^([0-9]([0-9]([0-9][1-9]|[1-9]0)|[1-9]00)|[1-9]000)(-(0[1-9]|1[0-2])(-(0[1-9]|[1-2][0-9]|3[0-1]))?)?$",
          "type": "string"
        },
        "answerDateTime": {
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "pattern": "^([0-9]([0-9]([0-9][1-9]|[1-9]0)|[1-9]00)|[1-9]000)(-(0[1-9]|1[0-2])(-(0[1-9]|[1-2][0-9]|3[0-1])(T([01][0-9]|2[0-3]):[0-5][0-9]:([0-5][0-9]|60)(\\.[0-9]+)?(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00)))?)?)?$",
          "type": "string"
        },
        "answerTime": {
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "pattern": "^([01][0-9]|2[0-3]):[0-5][0-9]:([0-5][0-9]|60)(\\.[0-9]+)?$",
          "type": "string"
        },
        "answerString": {
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "pattern": "^[ \\r\\n\\t\\S]+$",
          "type": "string"
        },
        "answerCoding": {
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "$ref": "#/definitions/Coding"
        },
        "answerQuantity": {
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "$ref": "#/definitions/Quantity"
        },
        "answerReference": {
          "description": "A value that the referenced question is tested using the specified operator in order for the item to be enabled.",
          "$ref": "#/definitions/Reference"
        }
      },
      "additionalProperties": false
    }

https://www.hl7.org/fhir/questionnaire.schema.json.html

rubberduck203 commented 3 years ago

This boils down to cycle detection in a directed graph. If a node points to itself, it is not a problem and we can leave it alone.

It's when a node depends on it's parent that things become problematic.

Or on some other ancestor.

If we can detect the cycle, and inline the definition, we can "collapse" cyclic nodes into a self-referencing node.

The goal being to create a not quite directed acyclic graph where each node only depends on its children or itself. (It's the "or itself" part that makes this not a true DAG.)

rubberduck203 commented 3 years ago

Relevant Research:

It should be noted that Johnson's algorithm is probably the most widely known and understood algorithm for cycle detection.

carreraGroup / avro-on-fhir

Circular references and large schemas are problematic #2