cldf / csvw

CSV on the web
Apache License 2.0
36 stars 6 forks source link

Foreign Key Setup in JSON Schema #74

Closed megin1989 closed 3 months ago

megin1989 commented 3 months ago

How do we set a foreign key here? When we run our JSON, we encounter the following errors. I've attached the JSON schema and py file code below.

/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py:426: UserWarning: Invalid property pattern for Column
  warnings.warn('Invalid property {} for {}'.format(k, type_name))
/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py:426: UserWarning: Invalid property enum for Column
  warnings.warn('Invalid property {} for {}'.format(k, type_name))
Traceback (most recent call last):
  File "/home/megin/workspaces/csv-sql-schema/csv-old/csvw/new.py", line 10, in <module>
    table_group = TableGroup.from_file(datapackage_path)
  File "/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py", line 1047, in from_file
    res = cls.fromvalue(data or get_json(fname))
  File "/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py", line 433, in fromvalue
    return cls(**cls.partition_properties(d))
  File "<attrs generated init csvw.metadata.TableGroup>", line 39, in __init__
  File "/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py", line 1426, in converter_tables
    res.append(Table.fromvalue(vv) if isinstance(vv, dict) else vv)
  File "/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py", line 433, in fromvalue
    return cls(**cls.partition_properties(d))
  File "<attrs generated init csvw.metadata.Table>", line 31, in __init__
  File "/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py", line 1006, in <lambda>
    converter=lambda v: Schema.fromvalue(v))
  File "/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py", line 930, in fromvalue
    return cls(**cls.partition_properties(v))
  File "<attrs generated init csvw.metadata.Schema>", line 30, in __init__
  File "/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py", line 886, in <lambda>
    converter=lambda v: [] if v is None else converter_foreignKeys(v))
  File "/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py", line 862, in converter_foreignKeys
    res.append(ForeignKey.fromdict(d))
  File "/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py", line 844, in fromdict
    _ = Reference(**d['reference'])
KeyError: 'reference'

This is my JSON schema.

{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [
      {
          "url": "data/QE_ADMIN_DATA_qcs-test-20240603-testcase4.csv",
          "tableSchema": {
            "columns": [
              {"name": "PAT_MRN_ID", "titles": "PAT_MRN_ID", "datatype": "string", "required": true},
              {"name": "FACILITY_ID", "titles": "FACILITY_ID", "datatype": "string", "required": true},
              {"name": "FACILITY_LONG_NAME", "titles": "FACILITY_LONG_NAME", "datatype": "string", "required": true, "pattern": "^[a-zA-Z\\s]+$"},
              {"name": "ORGANIZATION_TYPE", "titles": "ORGANIZATION_TYPE", "datatype": "string", "required": true, "enum": ["prov","dept","team","govt","ins","pay","edu","reli","crs","cg","bus","other","laboratory","imaging","pharmacy","health-information-network","health-data-aggregator"]},
              {"name": "FACILITY_ADDRESS1", "titles": "FACILITY_ADDRESS1", "datatype": "string", "required": true},
              {"name": "FACILITY_ADDRESS2", "titles": "FACILITY_ADDRESS2", "datatype": "string"},
              {"name": "FACILITY_CITY", "titles": "FACILITY_CITY", "datatype": "string"},
              {"name": "FACILITY_STATE", "titles": "FACILITY_STATE", "datatype": "string", "enum": ["FACILITY_STATE", "NY", "New York"]},
              {"name": "FACILITY_ZIP", "titles": "FACILITY_ZIP", "datatype": "string", "required": true},
              {"name": "VISIT_PART_2_FLAG", "titles": "VISIT_PART_2_FLAG", "datatype": "string", "required": true, "enum": ["Yes", "No"]},
              {"name": "VISIT_OMH_FLAG", "titles": "VISIT_OMH_FLAG", "datatype": "string", "required": true, "enum": ["Yes", "No"]},
              {"name": "VISIT_OPWDD_FLAG", "titles": "VISIT_OPWDD_FLAG", "datatype": "string", "required": true, "enum": ["Yes", "No"]}
            ],
              "primaryKey": ["PAT_MRN_ID"]
          },
          "dialect": {
              "delimiter": "|"
          }
      },
      {
          "url": "data/SCREENING_qcs-test-20240603-testcase4.csv",
          "tableSchema": {
              "columns": [
                  {"name": "PAT_MRN_ID", "titles": "PAT_MRN_ID", "datatype": "string", "required": true},
                  {"name": "FACILITY_ID", "titles": "FACILITY_ID", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_ID", "titles": "ENCOUNTER_ID", "datatype": "string"},
                  {"name": "ENCOUNTER_CLASS_CODE", "titles": "ENCOUNTER_CLASS_CODE", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_CLASS_CODE_DESCRIPTION", "titles": "ENCOUNTER_CLASS_CODE_DESCRIPTION", "datatype": "string"},
                  {"name": "ENCOUNTER_CLASS_CODE_SYSTEM", "titles": "ENCOUNTER_CLASS_CODE_SYSTEM", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_STATUS_CODE", "titles": "ENCOUNTER_STATUS_CODE", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_STATUS_CODE_DESCRIPTION", "titles": "ENCOUNTER_STATUS_CODE_DESCRIPTION", "datatype": "string"},
                  {"name": "ENCOUNTER_STATUS_CODE_SYSTEM", "titles": "ENCOUNTER_STATUS_CODE_SYSTEM", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_TYPE_CODE", "titles": "ENCOUNTER_TYPE_CODE", "datatype": "string"},
                  {"name": "ENCOUNTER_TYPE_CODE_DESCRIPTION", "titles": "ENCOUNTER_TYPE_CODE_DESCRIPTION", "datatype": "string"},
                  {"name": "ENCOUNTER_TYPE_CODE_SYSTEM", "titles": "ENCOUNTER_TYPE_CODE_SYSTEM", "datatype": "string"},
                  {"name": "SCREENING_STATUS_CODE", "titles": "SCREENING_STATUS_CODE", "datatype": "string", "required": true},
                  {"name": "SCREENING_STATUS_CODE_DESCRIPTION", "titles": "SCREENING_STATUS_CODE_DESCRIPTION", "datatype": "string"},
                  {"name": "SCREENING_STATUS_CODE_SYSTEM", "titles": "SCREENING_STATUS_CODE_SYSTEM", "datatype": "string", "required": true},
                  {"name": "SCREENING_CODE", "titles": "SCREENING_CODE", "datatype": "string", "required": true},
                  {"name": "SCREENING_CODE_DESCRIPTION", "titles": "SCREENING_CODE_DESCRIPTION", "datatype": "string", "required": true},
                  {"name": "SCREENING_CODE_SYSTEM_NAME", "titles": "SCREENING_CODE_SYSTEM_NAME", "datatype": "string", "required": true},
                  {"name": "RECORDED_TIME", "titles": "RECORDED_TIME", "datatype": "datetime", "required": true},
                  {"name": "QUESTION_CODE", "titles": "QUESTION_CODE", "datatype": "string", "required": true},
                  {"name": "QUESTION_CODE_DESCRIPTION", "titles": "QUESTION_CODE_DESCRIPTION", "datatype": "string", "required": true},
                  {"name": "QUESTION_CODE_SYSTEM_NAME", "titles": "QUESTION_CODE_SYSTEM_NAME", "datatype": "string", "required": true},
                  {"name": "UCUM_UNITS", "titles": "UCUM_UNITS", "datatype": "string"},
                  {"name": "SDOH_DOMAIN", "titles": "SDOH_DOMAIN", "datatype": "string", "required": true},
                  {"name": "PARENT_QUESTION_CODE", "titles": "PARENT_QUESTION_CODE", "datatype": "string"},
                  {"name": "ANSWER_CODE", "titles": "ANSWER_CODE", "datatype": "string", "required": true},
                  {"name": "ANSWER_CODE_DESCRIPTION", "titles": "ANSWER_CODE_DESCRIPTION", "datatype": "string", "required": true},
                  {"name": "ANSWER_CODE_SYSTEM_NAME", "titles": "ANSWER_CODE_SYSTEM_NAME", "datatype": "string", "required": true},
                  {"name": "POTENTIAL_NEED_INDICATED", "titles": "POTENTIAL_NEED_INDICATED", "datatype": "string", "required": true}
              ],
              "foreignKeys": [
                {
                  "columns": ["PAT_MRN_ID"],
                  "resource": "data/QE_ADMIN_DATA_qcs-test-20240603-testcase4.csv",
                    "columnReference": [
                      "PAT_MRN_ID"
                    ]
                }
              ]
          },
          "dialect": {
              "delimiter": "|"
          }
      } 
  ]
}

The py file code

import json
from csvw import TableGroup

datapackage_path = 'datapackage.json'
with open(datapackage_path, 'r') as f:
    datapackage_content = json.load(f)

table_group = TableGroup.from_file(datapackage_path)

def validate_table_group(table_group):
    errors = []

    for table in table_group.tables:
        table_url = table.url
        table_schema = table.tableSchema

        for column in table_schema.columns:
            if not hasattr(column, 'name'):
                errors.append(f"Column without a name in table {table_url}")
            if not hasattr(column, 'datatype'):
                errors.append(f"Column {getattr(column, 'name', '')} without a datatype in table {table_url}")

        if hasattr(table_schema, 'primaryKey') and not isinstance(table_schema.primaryKey, list):
            errors.append(f"Invalid primaryKey in table {table_url}")

        if hasattr(table_schema, 'foreignKeys'):
            for fk in table_schema.foreignKeys:
                if not isinstance(fk.columns, list) or not isinstance(fk.reference.columnReference, list):
                    errors.append(f"Invalid foreignKey specification in table {table_url}")

    return errors

validation_errors = validate_table_group(table_group)

if validation_errors:
    for error in validation_errors:
        print(f"Validation Error: {error}")
else:
    print("No validation errors found.")

print(table_group)
xrotwang commented 3 months ago

Your foreignKeys property seems to be ill-specified. It should look like

"foreignKeys": [{
        "columnReference": ["PAT_MRN_ID"],
        "reference": {
          "resource": "QE_ADMIN_DATA_qcs-test-20240603-testcase4.csv",
          "columnReference": ["PAT_MRN_ID"]
        }
      }]

The error you see complains about the lack of the nested reference property.

megin1989 commented 3 months ago
{
          "url": "data/SCREENING_qcs-test-20240603-testcase4.csv",
          "tableSchema": {
              "columns": [
                  {"name": "PAT_MRN_ID", "titles": "PAT_MRN_ID", "datatype": "string", "required": true},
                  {"name": "FACILITY_ID", "titles": "FACILITY_ID", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_ID", "titles": "ENCOUNTER_ID", "datatype": "string"},
                  {"name": "ENCOUNTER_CLASS_CODE", "titles": "ENCOUNTER_CLASS_CODE", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_CLASS_CODE_DESCRIPTION", "titles": "ENCOUNTER_CLASS_CODE_DESCRIPTION", "datatype": "string"},
                  {"name": "ENCOUNTER_CLASS_CODE_SYSTEM", "titles": "ENCOUNTER_CLASS_CODE_SYSTEM", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_STATUS_CODE", "titles": "ENCOUNTER_STATUS_CODE", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_STATUS_CODE_DESCRIPTION", "titles": "ENCOUNTER_STATUS_CODE_DESCRIPTION", "datatype": "string"},
                  {"name": "ENCOUNTER_STATUS_CODE_SYSTEM", "titles": "ENCOUNTER_STATUS_CODE_SYSTEM", "datatype": "string", "required": true},
                  {"name": "ENCOUNTER_TYPE_CODE", "titles": "ENCOUNTER_TYPE_CODE", "datatype": "string"},
                  {"name": "ENCOUNTER_TYPE_CODE_DESCRIPTION", "titles": "ENCOUNTER_TYPE_CODE_DESCRIPTION", "datatype": "string"},
                  {"name": "ENCOUNTER_TYPE_CODE_SYSTEM", "titles": "ENCOUNTER_TYPE_CODE_SYSTEM", "datatype": "string"},
                  {"name": "SCREENING_STATUS_CODE", "titles": "SCREENING_STATUS_CODE", "datatype": "string", "required": true},
                  {"name": "SCREENING_STATUS_CODE_DESCRIPTION", "titles": "SCREENING_STATUS_CODE_DESCRIPTION", "datatype": "string"},
                  {"name": "SCREENING_STATUS_CODE_SYSTEM", "titles": "SCREENING_STATUS_CODE_SYSTEM", "datatype": "string", "required": true},
                  {"name": "SCREENING_CODE", "titles": "SCREENING_CODE", "datatype": "string", "required": true},
                  {"name": "SCREENING_CODE_DESCRIPTION", "titles": "SCREENING_CODE_DESCRIPTION", "datatype": "string", "required": true},
                  {"name": "SCREENING_CODE_SYSTEM_NAME", "titles": "SCREENING_CODE_SYSTEM_NAME", "datatype": "string", "required": true},
                  {"name": "RECORDED_TIME", "titles": "RECORDED_TIME", "datatype": "datetime", "required": true},
                  {"name": "QUESTION_CODE", "titles": "QUESTION_CODE", "datatype": "string", "required": true},
                  {"name": "QUESTION_CODE_DESCRIPTION", "titles": "QUESTION_CODE_DESCRIPTION", "datatype": "string", "required": true},
                  {"name": "QUESTION_CODE_SYSTEM_NAME", "titles": "QUESTION_CODE_SYSTEM_NAME", "datatype": "string", "required": true},
                  {"name": "UCUM_UNITS", "titles": "UCUM_UNITS", "datatype": "string"},
                  {"name": "SDOH_DOMAIN", "titles": "SDOH_DOMAIN", "datatype": "string", "required": true},
                  {"name": "PARENT_QUESTION_CODE", "titles": "PARENT_QUESTION_CODE", "datatype": "string"},
                  {"name": "ANSWER_CODE", "titles": "ANSWER_CODE", "datatype": "string", "required": true},
                  {"name": "ANSWER_CODE_DESCRIPTION", "titles": "ANSWER_CODE_DESCRIPTION", "datatype": "string", "required": true},
                  {"name": "ANSWER_CODE_SYSTEM_NAME", "titles": "ANSWER_CODE_SYSTEM_NAME", "datatype": "string", "required": true},
                  {"name": "POTENTIAL_NEED_INDICATED", "titles": "POTENTIAL_NEED_INDICATED", "datatype": "string", "required": true}
              ],
              "foreignKeys": [{
              "columnReference": ["PAT_MRN_ID"],
              "reference": {
                "resource": "QE_ADMIN_DATA_qcs-test-20240603-testcase4.csv",
                "columnReference": ["PAT_MRN_ID"]
              }
            }]
          },
          "dialect": {
              "delimiter": "|"
          }
      }  
  When i update like this i got following errors

  `/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py:426: UserWarning: Invalid property pattern for Column warnings.warn('Invalid property {} for {}'.format(k, type_name))

/home/megin/.local/lib/python3.10/site-packages/csvw/metadata.py:426: UserWarning: Invalid property enum for Column warnings.warn('Invalid property {} for {}'.format(k, type_name)) Traceback (most recent call last): File "/home/megin/workspaces/csv-sql-schema/csv-old/csvw/new.py", line 40, in validation_errors = validate_table_group(table_group) File "/home/megin/workspaces/csv-sql-schema/csv-old/csvw/new.py", line 34, in validate_table_group if not isinstance(fk.columns, list) or not isinstance(fk.reference.columnReference, list): AttributeError: 'ForeignKey' object has no attribute 'columns'`

Please help me to fix.

xrotwang commented 3 months ago

But that error seems to come from your own code:

if not isinstance(fk.columns, list) or not isinstance(fk.reference.columnReference, list):

A ForeignKey object has indeed no columns attribute, but only a columnReference and a reference.

megin1989 commented 3 months ago

Thank you. I fixed the errors and got the result, but the output prints 'No validation errors found,' even though our CSV file has validation errors against the JSON file. Is there another Python code option to validate it?

 import json
from csvw import TableGroup

datapackage_path = 'datapackage.json'
with open(datapackage_path, 'r') as f:
    frictionless_datapackage = json.load(f)

table_group = TableGroup.from_frictionless_datapackage(frictionless_datapackage)

def validate_table_group(table_group):
    errors = []
    for table in table_group.tables:
        table_url = table.url
        table_schema = table.tableSchema

        for column in table_schema.columns:
            if not hasattr(column, 'name'):
                errors.append(f"Column without a name in table {table_url}")
            if not hasattr(column, 'datatype'):
                errors.append(f"Column {getattr(column, 'name', '')} without a datatype in table {table_url}")

        if hasattr(table_schema, 'primaryKey') and not isinstance(table_schema.primaryKey, list):
            errors.append(f"Invalid primaryKey in table {table_url}")

        if hasattr(table_schema, 'foreignKeys'):
            for fk in table_schema.foreignKeys:
                if not isinstance(fk.columnReference, list) or not isinstance(fk.reference.columnReference, list):
                    errors.append(f"Invalid foreignKey specification in table {table_url}")

    return errors

validation_errors = validate_table_group(table_group)

if validation_errors:
    for error in validation_errors:
        print(f"Validation Error: {error}")
else:
    print("No validation errors found.")

output_file = 'validation_results1.json'
with open(output_file, 'w', encoding='utf-8') as json_file:
    json.dump(validation_errors, json_file, indent=4)

I also tried this code with datapackage.jsonld (JSON-LD) and received the correct errors. Below is my Python code. Is this correct?

import csv
import json
import os
import re
from datetime import datetime

metadata_file = 'datapackage.jsonld'
with open(metadata_file, 'r', encoding='utf-8') as f:
    metadata = json.load(f)

tables = metadata['tables']

def validate_csv(csv_file_path, table_schema, delimiter):
    errors = []

    with open(csv_file_path, 'r', encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file, delimiter=delimiter)
        for row_num, row in enumerate(reader, start=1):
            for column in table_schema['columns']:
                col_name = column['name']
                col_titles = column.get('titles', col_name)
                value = row.get(col_titles)
                if column.get('required') and not value:
                    errors.append(f"Row {row_num}: '{col_titles}' is required but missing.")
                if 'enum' in column and value and value not in column['enum']:
                    errors.append(f"Row {row_num}: '{value}' in '{col_titles}' is not a valid value.")
                if 'pattern' in column and value and not re.match(column['pattern'], value):
                    errors.append(f"Row {row_num}: '{value}' in '{col_titles}' does not match pattern {column['pattern']}.")
                if column['datatype'] == 'datetime':
                    try:
                        datetime.strptime(value, '%Y-%m-%dT%H:%M:%S')
                    except ValueError:
                        errors.append(f"Row {row_num}: '{value}' in '{col_titles}' is not a valid datetime.")
                if column['datatype'] == 'date':
                    try:
                        datetime.strptime(value, '%Y-%m-%d')
                    except ValueError:
                        errors.append(f"Row {row_num}: '{value}' in '{col_titles}' is not a valid date.")

    return errors

csv_base_dir = ''

validation_results = {}

for table in tables:
    url = table['url']
    table_schema = table['tableSchema']
    dialect = table.get('dialect', {})

    delimiter = dialect.get('delimiter', ',')
    csv_file_path = os.path.join(csv_base_dir, url)

    if not os.path.exists(csv_file_path):
        validation_results[csv_file_path] = ["File not found."]
        continue

    errors = validate_csv(csv_file_path, table_schema, delimiter)

    if errors:
        validation_results[csv_file_path] = errors
    else:
        validation_results[csv_file_path] = ["Valid"]

output_file = 'validation_results.json'
with open(output_file, 'w', encoding='utf-8') as json_file:
    json.dump(validation_results, json_file, indent=4)

print("Validation completed.")
xrotwang commented 3 months ago

Well, in your first code snippet you only validate the metadata, not the data - while in the second you go through each row in the actual data file.

If you want to do this for a TableGroup, you can copy the code from CSVW.is_valid: https://github.com/cldf/csvw/blob/90ac48577a659cc435a5bd442a907781317f7547/src/csvw/metadata.py#L1660-L1666

megin1989 commented 3 months ago

Can you please give a sample file link or full code?

xrotwang commented 3 months ago

In your function validate_table_group above, add the code snippet I've given, replacing self and self.tablegroup with table_group.

megin1989 commented 3 months ago

Yes, again I got this message "No validation errors found."

for table in table_group.tables: 
    for _ in table.iterdicts(strict=False): 
        pass 
    if not table.check_primary_key():  # pragma: no cover 
        warnings.warn('Duplicate primary key') 
if not table_group.check_referential_integrity(strict=True): 
    warnings.warn('Referential integrity check failed') 

I also tried this code with datapackage.jsonld (JSON-LD) and received the correct errors.

xrotwang commented 3 months ago

Since I don't have your data available, I cannot really check - or know what the "correct errors" should be. But if datapackage.jsonld works for you, you might just stick to that?