frictionlessdata / data-quality-spec

A spec for reporting errors in data quality.
MIT License
20 stars 3 forks source link

Other types of quality issues #4

Closed danfowler closed 7 years ago

danfowler commented 8 years ago

These might be other kinds of data quality issues that could be specified. For instance, numbers in a column should always be increasing, numbers in a column are "suspicious" (match the highest possible value for a given type). But maybe we want to keep these kind of "warning" issues outside the scope.

Some good sources:

pwalsh commented 8 years ago

@danfowler great input.

Some things here would I think belong in a custom processor, rather than a declarative quality spec. For example, incrementing integers is a really hard judgment to pass on a file without any context.

But these are great to track. Part of the idea of pluggable processors in good tables and now tabulator is precisely to implement such checks.

ref.:

roll commented 8 years ago

Using current goodtables.next design it will be achievable like this (introducing your custom spec check):

def check_order(fields, state):
    # yield errors like {'field-numer': 3, 'message': 'Bad order in a field'}

inspector = goodtables.Inspector(checks={
    'check-order': {
        'func': check_order,
        'after': 'ragged-row',       
        'type': 'structure', 
        'context': 'body', 
    },
})

So for spec it will be enough to contain only core standardized errors.

pwalsh commented 8 years ago

@roll i can't wait to start testing out the goodtables refactor!!!

roll commented 7 years ago

So API for this:


Custom checks

It's a provisional API excluded from SemVer. If you use it as a part of other program please pin concrete goodtables version to your requirements file.

To register a custom check user could use a check(error) decorator. This way the builtin check could be overridden (use error code like duplicate-row instead of dictionary like in example below) or could be added a check for a custom error (use after/before argument to set an insertion position):

from goodtables import Inspector, check

error = {
    'code': 'custom-error',
    'type': 'structure',
    'context': 'body',
}
@check(error, after='blank-row')
def custom_check(errors, columns, row_number,  state=None):
    for column in columns:
        errors.append({
            'code': 'custom-error',
            'message': 'Custom error',
            'row-number': row_number,
            'column-number': column['number'],
        })
        columns.remove(column)

inspector = Inspector(custom_checks=[custom_check])

Please re-open if need other actions or tracking here.