SPIKE: Investigate PHDC Validation

DanPaseltiner commented 11 months ago

We need to understand how we should go about ensuring that the PHDC messages we are produce are valid. However, we do not want to turn this into a deep rabbit hole to get lost in. Therefore, we will timebox this work through EOD on 01/08/2024.

Acceptance Criteria: Artifacts are included or linked to this ticket describing the options for PHDC validation and some initial ideas on how we might implement them.

robertmitchellv commented 10 months ago

Based on the NBS PHDC Interface Spec v1.4, we see that all PHC docs begin with the <ClinicalDocument> tag:

<ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA_SDTC.xsd"
xmlns="urn:hl7-org:v3" xmlns:sdtc="urn:hl7-org:sdtc">

Based on the above, we can see a couple of things happening:

xsi:schemaLocation="urn:hl7-org:v3 CDA_SDTC.xsd": tells us that we're using HL7 v3 CDA with the SDTC extensions. We'll need the CDA_SDTC.xsd file for validation of outputs to ensure they're well formed and valid.
We'll likely need more details about the SDTC extensions and the relationships between them and the HL7v3 CDA standards

Luckily the files we need can be found in this repo here:

https://github.com/HL7/CDA-core-2.0/tree/master/schema/extensions/SDTC/infrastructure/cda

CDA_SDTC.xsd
POCD_MT000040_SDTC.xsd
SDTC.xsd

If we look at CDA_SDTC.xsd it is referencing the other doc here:

<xs:include schemaLocation="POCD_MT000040_SDTC.xsd"/>

Looking at POCD_MT000040_SDTC.xsd we see it is fairly robust and ~1594 lines in length. Looking at this file we can see that it is also importing the SDTC.xsd file here:

<xs:import namespace="urn:hl7-org:sdtc" schemaLocation="SDTC.xsd" />

So having all three will be useful for validation of our outputs.

In connection to how we move forward we can do one of two things:

We can validate after each step in the build process; implementing will be more complex but it'll catch errors earlier.
We can validate after construction using the three files above and let the builder focus on what it needs to do.

Personally, I'm in favor of option 2 since there are still some outstanding questions I have about import into NBS, what data within a PHDC is parsed and available in tables for analysis vs viewable within the UI and to what extent the solve we engineer requires either of these or both to be present. Keeping it simple for the moment will allow us to focus more on the details of the spec that are unambiguous.

robertmitchellv commented 10 months ago

I wrote a very quick validation script in python using lxml and it turns out that we need the entire CDA-core-2.0 repo's schema directory in order to validate PHDC XML. Moreover, we need to maintain the specific directory structure as well; since there are references to other .xsd throughout starting with CDA_SDTC.xsd on down and it'll throw an error if it cannot find the files where it expects to find them. However, I was able to validate our two sample PHDC files once I adjusted the path and copied the entire schema directory into my script. So using these two files in a data/ directory:

SamplePHDC_1.xml
SamplePHDC_2.xml

I was able to get the following response:

python validate_phdc.py
SamplePHDC_2.xml is valid.
SamplePHDC_1.xml is valid.

Using the following script:

from lxml import etree
from pathlib import Path

def load_schema(schema_file):
    with open(schema_file, "r") as file:
        schema_doc = etree.parse(file)
        return etree.XMLSchema(schema_doc)

def validate_xml(xml_file, schemas):
    with open(xml_file, "r") as file:
        xml_doc = etree.parse(file)
        for schema_name, schema in schemas.items():
            if not schema.validate(xml_doc):
                print(f"Validation failed for {xml_file.name} against {schema_name}")
                return False
    return True

# create Path objects for directories
schema_dir = Path("schema/extensionsSDTC/infrastructure/cda/")
data_dir = Path("data/")

# load all schemas
schemas = {
    schema_file.name: load_schema(schema_file)
    for schema_file in schema_dir.glob("*.xsd")
}

# iterate over XML files in 'data' directory
for xml_file in data_dir.glob("*.xml"):
    if validate_xml(xml_file, schemas):
        print(f"{xml_file.name} is valid.")

This was just to get a feel for how we would validate using the second approach from above.

DanPaseltiner commented 10 months ago

Thanks for putting all of this together. I'm glad to see we have a POC for running validation on the example PHDCs. I think the appropriate place to run validation is during CI. We definitely want validation code separate from the PHDC builder and we also shouldn't run validation regularly on every PHDC we produce:

This will be a huge performance hit
We need to do the work up front to ensure that any PHDC a client receives from us is valid

So by putting validation into a unit test and an integration test we will ensure that our outputs are valid PHDC and prevent ourselves from introduces changes that would result in invalid PHDC.

@robertmitchellv could you add 2 tickets to the backlog?

Include PHDC validation in at least one unit test of the PHDC builder
Write an integration test for the fhir-to-phdc endpoint that runs the returned PHDC through validation.

I don't think we are ready to pick up these tickets, but should definitely have them in the backlog.

CDCgov / phdi

SPIKE: Investigate PHDC Validation #1019