A Schema
is a specification of a data set, which may contain more than 1 table.
It exists independently of any particular data set, and therefore can be constructed in the absence of a data set.
This package facilitates 3 use cases:
Read/write a schema from/to a TOML file.
Compare a data set to a schema and list the non-compliance issues.
Transform a data set to comply with a schema as much as possible and return a table of outstanding issues.
Although a schema can be specified in code, it is more practical to specify it in a configuration file.
This has the added benefit of sharing it with non-technical data custodians and researchers,
providing a common language for discussing data-related issues.
Indeed the 3 use cases listed above can be carried out without writing any Julia code - just run the compare
script in the scripts
directory.
A TableSchema
looks like this TOML
file:
name = "mytable"
description = "My table"
primarykey = "patientid" # A column name or a vector of column names
columns = [
{name = "patientid", description = "Patient ID", datatype = "UInt", validvalues = "UInt", iscategorical = false, isrequired = true, isunique = true},
{name = "age", description = "Age (years)", datatype = "Int", validvalues = "Int", iscategorical = false, isrequired = true, isunique = false},
{name = "dose", description = "Dose size", datatype = "String", validvalues = ["small", "medium", "large"], iscategorical = true, isrequired = true, isunique = false},
{name = "fever", description = "Had fever", datatype = "Bool", validvalues = "Bool", iscategorical = true, isrequired = true, isunique = false}
]
A Schema
contains 1 or more TableSchema
. For example:
name = "fever"
description = "Fever schema"
[tables]
table1 = "table1_schema"
table2 = "table2_schema"
For tables that fit into memory, usage is as follows:
# Read in a schema
using Schemata
schema = readschema(joinpath(dirname(pathof(Schemata)), "..", "test/schemata/fever.toml"))
ts = schema.tables[:mytable] # TableSchema for mytable
# Construct/import a table (any object that satisfies the Tables.jl interface)
using DataFrames
table = DataFrame(
patientid = [1, 2, 3, 4],
age = [11, 22, 33, 444], # Note that 444 is not in the set of valid values according to the schema
dose = ["small", "medium", "large", "medium"],
fever = [false, true, true, false]
)
# Transform the table to comply with the schema.
# Values that are unparseable or invalid are set to missing.
# Return the transformed data, a table of input data issues and a table of output data issues.
outdata, input_issues, output_issues = compare(ts, table)
For tables that are too big to fit into memory, replace the table argument with the filename of the table:
# Transform the table to comply with the schema.
# Values that are unparseable or invalid are set to missing.
# Write the transformed data, a table of input data issues and a table of output data issues to disk.
input_data_file = "/path/to/mytable.tsv"
output_data_file = "/path/to/transformed_table.tsv"
input_issues_file = "/path/to/input_issues.tsv"
output_issues_file = "/path/to/output_issues.tsv"
compare(ts, input_data_file; output_data_file=output_data_file, input_issues_file=input_issues_file, output_issues_file=output_issues_file)
# Or simply...
compare(ts, input_data_file) # output_data_file, input_issues_file, output_issues_file have default values
If your data is sorted by its primary key, set the keyword argument sorted_by_primarykey
to true
(default is false
).
That is, use compare(ts, table; sorted_by_primarykey=true)
or compare(ts, input_data_file; sorted_by_primarykey=true)
.
In this case the compare
function assumes that your input data table is sorted by its primary key (as specified in the schema),
which allows for a faster comparison between the data and the schema.
Users can provide custom parsers, which enable parsing of values:
Core
module.Core
types in non-standard ways, such as custom date formats.Core
types in standard ways with a unified interface.A user-defined parser can be specified directly or in a Dict
, and therefore can be specified in a configuration file.
You can specify positional and/or keyword arguments as required.
For example, the following code from the test suite defines a user-defined parser for a ZonedDateTime
.
Note the specification of a range of non-Core
types, namely (startvalue, stepsize, endvalue)
.
# Define user-defined parser
using TimeZones
function my_zdt_parser(s::T, tz::String) where {T <: AbstractString}
occursin(':', s) && return ZonedDateTime(DateTime(s[1:16]), TimeZone(tz)) # Example: s="2020-12-31T09:30:59+10:00"
dt = Date(eval(Meta.parse(s))) # Examples: s="today()", s="2020-11-01"
ZonedDateTime(DateTime(dt), TimeZone(tz))
end
my_zdt_parser(dttm::DateTime, tz::String) = ZonedDateTime(dttm, TimeZone(tz))
# Dict for ColumnSchema constructor, obtained after reading toml
d = Dict("name" => "zdt", "description" => "Test custom parser for TimeZones.ZonedDateTime",
"datatype" => "ZonedDateTime",
"iscategorical" => false, "isrequired" => true, "isunique" => true,
"validvalues" => "(today()-Year(2), Hour(1), today()-Day(1))", # Ensure that the range has sufficient resolution
"parser" => Dict("function" => "my_zdt_parser", "args"=>["Australia/Melbourne"]))
# Need to eval datatype and the parser function in the same scope that they were defined (and before constructing the ColumnSchema).
# Schemata.jl can't see the datatype and parser function until it receives them from the current scope.
d["datatype"] = eval(Meta.parse(d["datatype"]))
d["parser"]["function"] = eval(Meta.parse(d["parser"]["function"]))
# Now the schema constructors can be used
cs = ColumnSchema(d)
ts = TableSchema(:mytable, "My table", [cs], [:zdt])
table = DataFrame(zdt=[DateTime(today() - Day(7)) + Hour(i) for i = 1:3])
target = [ZonedDateTime(table[i, :zdt], TimeZone("Australia/Melbourne")) for i = 1:3]
outdata, issues_in, issues_out = compare(ts, table)
outdata[!, :zdt] == target
table = DataFrame(zdt=[string(DateTime(today() - Day(7)) + Hour(i)) for i = 1:3]) # String type
outdata, issues_in, issues_out = compare(ts, table)
outdata[!, :zdt] == target
table = DataFrame(zdt=[string(ZonedDateTime(DateTime(today() - Day(7)) + Hour(i), TimeZone("Australia/Melbourne"))) for i = 1:3]) # String type
outdata, issues_in, issues_out = compare(ts, table)
outdata[!, :zdt] == target
We often want to ensure that certain relationships hold between variables within a row.
For example, we might require that a person's marriage date is after his/her birth date.
We can achieve this by specifying one or more intra-row constraints in a TableSchema
as follows:
name = "intrarow_constraints_demo"
description = "Table with intra-row constraints"
primarykey = "patientid"
intrarow_constraints = {"birth date before marriage date" = "r[:dob] < r[:date_of_marriage]"}
columns = [
{name="patientid", description = "Patient ID", datatype = "UInt", validvalues = "UInt", iscategorical = false, isrequired = true, isunique = true},
{name="dob", description = "Date of birth", datatype = "Date", validvalues = "Date", iscategorical = false, isrequired = true, isunique = false},
{name="date_of_marriage", description = "Date of marriage", datatype = "Date", validvalues = "Date", iscategorical = false, isrequired = false, isunique = false}
]
Each constraint is specified as a key-value pair, where the key is a description of the constraint and
the value is the right-hand side of a function of a row r
.
The function must return true
or false
.
When comparing the schema to a table, the function is executed on each row.
If the function returns false
for one or more rows, the constraint isn't satisfied and its description is recorded in the returned issues table.