Closed danfowler closed 7 years ago
@danfowler great input.
Some things here would I think belong in a custom processor, rather than a declarative quality spec. For example, incrementing integers is a really hard judgment to pass on a file without any context.
But these are great to track. Part of the idea of pluggable processors in good tables and now tabulator is precisely to implement such checks.
ref.:
Using current goodtables.next
design it will be achievable like this (introducing your custom spec check):
def check_order(fields, state):
# yield errors like {'field-numer': 3, 'message': 'Bad order in a field'}
inspector = goodtables.Inspector(checks={
'check-order': {
'func': check_order,
'after': 'ragged-row',
'type': 'structure',
'context': 'body',
},
})
So for spec it will be enough to contain only core standardized errors.
@roll i can't wait to start testing out the goodtables refactor!!!
So API for this:
It's a provisional API excluded from SemVer. If you use it as a part of other program please pin concrete
goodtables
version to your requirements file.
To register a custom check user could use a check(error)
decorator. This way the builtin check could be overridden (use error code like duplicate-row
instead of dictionary like in example below) or could be added a check for a custom error (use after/before
argument to set an insertion position):
from goodtables import Inspector, check
error = {
'code': 'custom-error',
'type': 'structure',
'context': 'body',
}
@check(error, after='blank-row')
def custom_check(errors, columns, row_number, state=None):
for column in columns:
errors.append({
'code': 'custom-error',
'message': 'Custom error',
'row-number': row_number,
'column-number': column['number'],
})
columns.remove(column)
inspector = Inspector(custom_checks=[custom_check])
Please re-open if need other actions or tracking here.
These might be other kinds of data quality issues that could be specified. For instance, numbers in a column should always be increasing, numbers in a column are "suspicious" (match the highest possible value for a given type). But maybe we want to keep these kind of "warning" issues outside the scope.
Some good sources: