KohlbacherLab / dnpm-dip-api-gateway

REST API Gateway component for DNPM:DIP
MIT License
0 stars 0 forks source link

Validation Error: using API endpoint patient-record:validate #8

Open kilpert opened 4 hours ago

kilpert commented 4 hours ago

I tried to validate a just downloaded patient-record against the API endpoint. But the response makes no sense, which makes me believe that the download schema is not correct.

Here is what I do (with python script). These are the main steps:

  1. Download the schema GET https://dnpm.bwhealthcloud.de/api/rd/etl/patient-record/schema => 200 OK
  2. Download the patient-record GET https://dnpm.bwhealthcloud.de/api/rd/fake/data/patient-record?format=application/json%2Bv2 => 200 OK
  3. But when I try to validate the just downloaded parient-record (headers={"Accept": "*/*", "Content-type": "application/json+v2"}; the json payload is the patient-record from 2.). POST https://dnpm.bwhealthcloud.de/api/rd/etl/patient-record:validate => 200 OK

    ... the response is weird:

Problem A:

From 3., I also get this response json:

{
    "patient": "64abb4fd-1de4-4167-a762-d339b30b8631",
    "issues": [
        {
            "severity": "warning",
            "message": "Fehlende Angabe 'Krankenkasse'",
            "path": "/Patient[64abb4fd-1de4-4167-a762-d339b30b8631]/Krankenkasse"
        },
        {
            "severity": "info",
            "message": "Fehlende optionale Angabe 'Todesdatum', ggf. nachpr\u00fcfen, ob nachzureichen",
            "path": "/Patient[64abb4fd-1de4-4167-a762-d339b30b8631]/Todesdatum"
        }
    ],
    "createdAt": "2024-10-17T12:58:57.532494Z"
}

This does make no sense, because "Krankenkasse" is not defined in the schema! At least not in the schema that one can download in 1.

Is the downloadable schema the same that is used for validation??? If not, how can we download the correct schema that was actually used?

Problem B:

The API endpoint in 3. does not validate properly; there are no reported errors. If validating locally in python using jsonschema.validate(patient_record, schema), there are 2 issues on age and vitalStatus:

Traceback (most recent call last):
  File "/vol/huge/Modellvorhaben/dnpm_dip_se/dnpm_dip_seq_api.py", line 206, in <module>
    print(jsonschema.validate(d_patient_record, d_schema))
  File "/home/kilpert/miniforge3/lib/python3.10/site-packages/jsonschema/validators.py", line 1332, in validate
    raise error
jsonschema.exceptions.ValidationError: Additional properties are not allowed ('age', 'vitalStatus' were unexpected)

Failed validating 'additionalProperties' in schema['properties']['patient']:
    {'$anchor': 'Patient',
     'additionalProperties': False,
     'properties': {'address': {'additionalProperties': False,
                                'properties': {'municipalityCode': {'type': 'string'}},
                                'required': ['municipalityCode'],
                                'type': 'object'},
                    'birthDate': {'format': 'date', 'type': 'string'},
                    'dateOfDeath': {'format': 'date', 'type': 'string'},
                    'gender': {'$ref': '#Coding_Gender'},
                    'healthInsurance': {'$ref': '#Reference'},
                    'id': {'$ref': '#Id'},
                    'managingSite': {'$ref': '#Coding'}},
     'required': ['id', 'gender', 'birthDate'],
     'type': 'object'}

On instance['patient']:
    {'age': {'unit': 'Years', 'value': 37},
     'birthDate': '1987-03-30',
     'gender': {'code': 'female',
                'display': 'Weiblich',
                'system': 'Gender'},
     'id': '8fb09ba6-dc92-4905-80c5-81b5aef98e0a',
     'vitalStatus': {'code': 'alive',
                     'display': 'Lebend',
                     'system': 'dnpm-dip/patient/vital-status'}}

However, none of these issues are raised by the validate API endpoint! Again, this is an indication that a wrong schema was used for validation.

lucienclin commented 4 hours ago

Again, thanks for checking this so thoroughly.

This requires a longer explanation, though:

The "validation" logic exposed by this validation endpoint is two-fold: After syntactic validation (i.e. if the payload can be properly deserialized as a PatientRecord upload) there is a semantic validation step.

This is required for various reasons:

Many of the attributes are made optional on the syntactic level, even though they are semantically required in order for a data set to be meaningful. The reason this was made so, is to separate concerns: Say you are an ETL developer in charge of extracting the data from respective primary systems and send them to the node backend. If this upload were to fail because semantically required attributes are missing due to incomplete documentation in the patient record, you'd be receiving upload rejections for which you can't really do anything. Further semantic validations include checks that references among objects in the record are resolvable (i.e. referential integrity) and whether coded entries (e.g. ICD-10 codes) are correctly resolvable in the respective code systems. Both these checks would not be possible on the schema level anyway. (in case of interest, see for instance here) Instead, such errors pertaining to the content of a patient record are raised in the "data quality issue report" created in the semantic validation step and (aside from being returned in the upload response) are stored in the validation module of the DNPM node. The idea is that these data quality issue reports be made available to documentarists in charge of completing/correcting the patient records accordingly. In contrast to syntactic validation errors, which would ultimately be dependent on the validation/deserialization library used, and whose content is developer- but not documentarist-friendly, these issue reports are specifically created with German error messages in order to be understandable by such documentarists. This is why in your above example, the error says "Fehlende Angabe 'Krankenkasse'" or "...Todesdatum", corresponding to missing attributes Patient.healthInsurance and Patient.dateOfDeath, which are defined in the schema.

Even though it's not incorporated yet, the portal will contain sub-portals for the respeective validation module, so that documentarists can log in and see which patient record have data quality issues to be fixed.

Does this answer your question?

lucienclin commented 4 hours ago

P.S. Concerning Problem B:

I already explained the reason for this "inconsistency" with Patient.age and Patient.vitalStatus in another comment: https://github.com/KohlbacherLab/dnpm-dip-api-gateway/issues/4#issuecomment-2378930499

These dynamic attributes are not supposed to be part of the data upload, they are just added for internal purposes of our system whenever a Patient object is deserialized to JSON, hence their presence in the random JSON examples. Just ignore them for setup of your ETL.

kilpert commented 2 hours ago

Let me summarise to see if I understand you correctly:

The validation process consists of two separate steps:

  1. syntax validation (using the json schema), which checks whether a data set can be processed at all
  2. subsequent semantic validation (checking the internal logic and completeness of the data set)
  3. the suggested way to upload a new patient record is to interactively test the json against the API validation endpoint, which returns a ‘translation’ of the original error message into German.

From a users' perspective, I would really appreciate it if you would also include the original error message. Returning the line number and especially the name of the actual variables would help a lot to understand where the problem actually occurred.

At least, I hoped that the json schema would help in determining whether the dataset is complete! If I can make a suggestion: please re-check and - if necessary - update the json schema to make at least the "required" variable reliable. This is, because it is used to communicate, which variable is mandatory on the confluence page (https://ibmi-ut.atlassian.net/wiki/spaces/DRD/pages/1474938/Data+Model+-+SE+dip). I think that most users are preparing their data sets based on this information.

It would also help, if you would update the documentation explaining this in a prominent section. This is not easy to understand right now. Especially, if it is expected to get an error message by default.