Closed DanPaseltiner closed 10 months ago
Based on the NBS PHDC Interface Spec v1.4, we see that all PHC docs begin with the <ClinicalDocument>
tag:
<ClinicalDocument xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA_SDTC.xsd"
xmlns="urn:hl7-org:v3" xmlns:sdtc="urn:hl7-org:sdtc">
Based on the above, we can see a couple of things happening:
xsi:schemaLocation="urn:hl7-org:v3 CDA_SDTC.xsd":
tells us that we're using HL7 v3 CDA with the SDTC extensions. We'll need the CDA_SDTC.xsd
file for validation of outputs to ensure they're well formed and valid.Luckily the files we need can be found in this repo here:
https://github.com/HL7/CDA-core-2.0/tree/master/schema/extensions/SDTC/infrastructure/cda
CDA_SDTC.xsd
POCD_MT000040_SDTC.xsd
SDTC.xsd
If we look at CDA_SDTC.xsd
it is referencing the other doc here:
<xs:include schemaLocation="POCD_MT000040_SDTC.xsd"/>
Looking at POCD_MT000040_SDTC.xsd
we see it is fairly robust and ~1594 lines in length. Looking at this file we can see that it is also importing the SDTC.xsd
file here:
<xs:import namespace="urn:hl7-org:sdtc" schemaLocation="SDTC.xsd" />
So having all three will be useful for validation of our outputs.
In connection to how we move forward we can do one of two things:
Personally, I'm in favor of option 2 since there are still some outstanding questions I have about import into NBS, what data within a PHDC is parsed and available in tables for analysis vs viewable within the UI and to what extent the solve we engineer requires either of these or both to be present. Keeping it simple for the moment will allow us to focus more on the details of the spec that are unambiguous.
I wrote a very quick validation script in python using lxml
and it turns out that we need the entire CDA-core-2.0
repo's schema
directory in order to validate PHDC XML. Moreover, we need to maintain the specific directory structure as well; since there are references to other .xsd
throughout starting with CDA_SDTC.xsd
on down and it'll throw an error if it cannot find the files where it expects to find them. However, I was able to validate our two sample PHDC files once I adjusted the path and copied the entire schema
directory into my script. So using these two files in a data/
directory:
SamplePHDC_1.xml
SamplePHDC_2.xml
I was able to get the following response:
python validate_phdc.py
SamplePHDC_2.xml is valid.
SamplePHDC_1.xml is valid.
Using the following script:
from lxml import etree
from pathlib import Path
def load_schema(schema_file):
with open(schema_file, "r") as file:
schema_doc = etree.parse(file)
return etree.XMLSchema(schema_doc)
def validate_xml(xml_file, schemas):
with open(xml_file, "r") as file:
xml_doc = etree.parse(file)
for schema_name, schema in schemas.items():
if not schema.validate(xml_doc):
print(f"Validation failed for {xml_file.name} against {schema_name}")
return False
return True
# create Path objects for directories
schema_dir = Path("schema/extensionsSDTC/infrastructure/cda/")
data_dir = Path("data/")
# load all schemas
schemas = {
schema_file.name: load_schema(schema_file)
for schema_file in schema_dir.glob("*.xsd")
}
# iterate over XML files in 'data' directory
for xml_file in data_dir.glob("*.xml"):
if validate_xml(xml_file, schemas):
print(f"{xml_file.name} is valid.")
This was just to get a feel for how we would validate using the second approach from above.
Thanks for putting all of this together. I'm glad to see we have a POC for running validation on the example PHDCs. I think the appropriate place to run validation is during CI. We definitely want validation code separate from the PHDC builder and we also shouldn't run validation regularly on every PHDC we produce:
So by putting validation into a unit test and an integration test we will ensure that our outputs are valid PHDC and prevent ourselves from introduces changes that would result in invalid PHDC.
@robertmitchellv could you add 2 tickets to the backlog?
fhir-to-phdc
endpoint that runs the returned PHDC through validation.I don't think we are ready to pick up these tickets, but should definitely have them in the backlog.
We need to understand how we should go about ensuring that the PHDC messages we are produce are valid. However, we do not want to turn this into a deep rabbit hole to get lost in. Therefore, we will timebox this work through EOD on 01/08/2024.
Acceptance Criteria: Artifacts are included or linked to this ticket describing the options for PHDC validation and some initial ideas on how we might implement them.