[x] There is no validation error due to the source data is invalid in some case
Analysis
Analysing this:
We are doing validation against source data on purpose, as besides remote data the source might local data and if we do not validate it prior passing data to "derive" processors, we won't be able to get report
Worth to mention that we pass all the parameters that tabulator takes to process the date. Meaning if we pass skipp_row: 5 with the desire to skip first 5 invalid rows, the same skipp_rows: 1 is passed to goodtables.validate method. And validation happens against data with the skipped row, not considering those 5.
That's being said expected behavior here should be that validator skips 5 rows, defines headers and performs checks after. While this is done so, it seems that validator initially loads the whole data (thinking it has 3 columns) and after skips, 5 rows, defines 2 columns and anyway checks against 3 columns and thinks there is extra value in data with empty strings init (This is a bit weird behavior but my looking at the report it seems so)
'tables': [
{
'time': 10.334,
'valid': False,
'error-count': 821,
'row-count': 827,
'source': 'http://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBEX3.M.XAU.USD.EA.AC.C06&its_csvFormat=en&its_fileFormat=csv&mode=its',
'headers': ['date', 'price'],
'scheme': 'http',
'format': 'csv', 'encoding': 'utf-8-sig', 'schema': 'table-schema',
'errors': [
{'code': 'extra-value', 'message': 'Row 7 has an extra value in column 3', 'row-number': 7, 'column-number': 3, 'row': ['1950-02', '34.730', '']},
{'code': 'extra-value', 'message': 'Row 8 has an extra value in column 3', 'row-number': 8, 'column-number': 3, 'row': ['1950-03', '34.730', '']}
...
Originally coming from https://github.com/datahq/datahub-qa/issues/238
Some of the original data can not pass validation even after passing parameter. Eg https://datahub.io/core/gold-prices/v/18
Acceptance Criteria
Analysis
Analysing this:
We are doing validation against source data on purpose, as besides remote data the source might local data and if we do not validate it prior passing data to "derive" processors, we won't be able to get report
Worth to mention that we pass all the parameters that tabulator takes to process the date. Meaning if we pass
skipp_row: 5
with the desire to skip first 5 invalid rows, the sameskipp_rows: 1
is passed togoodtables.validate
method. And validation happens against data with the skipped row, not considering those 5.That's being said expected behavior here should be that validator skips 5 rows, defines headers and performs checks after. While this is done so, it seems that validator initially loads the whole data (thinking it has 3 columns) and after skips, 5 rows, defines 2 columns and anyway checks against 3 columns and thinks there is extra value in data with empty strings init (This is a bit weird behavior but my looking at the report it seems so)
as a solution I modified flow.yaml for gold prices so that validator ignores that exact check for original data, passing
skip_check
parameter. See the commit https://github.com/datasets/gold-prices/commit/d1c65279059202c0dcdffb7f33841c056c0412bf for changes