datahubio / datahub-v2-pm

Project management (issues only)
8 stars 2 forks source link

Source data can not pass validation #218

Closed zelima closed 5 years ago

zelima commented 5 years ago

Originally coming from https://github.com/datahq/datahub-qa/issues/238

Some of the original data can not pass validation even after passing parameter. Eg https://datahub.io/core/gold-prices/v/18

Acceptance Criteria

Analysis

Analysing this:

We are doing validation against source data on purpose, as besides remote data the source might local data and if we do not validate it prior passing data to "derive" processors, we won't be able to get report

Worth to mention that we pass all the parameters that tabulator takes to process the date. Meaning if we pass skipp_row: 5 with the desire to skip first 5 invalid rows, the same skipp_rows: 1 is passed to goodtables.validate method. And validation happens against data with the skipped row, not considering those 5.

That's being said expected behavior here should be that validator skips 5 rows, defines headers and performs checks after. While this is done so, it seems that validator initially loads the whole data (thinking it has 3 columns) and after skips, 5 rows, defines 2 columns and anyway checks against 3 columns and thinks there is extra value in data with empty strings init (This is a bit weird behavior but my looking at the report it seems so)

'tables': [
  {
     'time': 10.334, 
     'valid': False, 
     'error-count': 821, 
     'row-count': 827, 
     'source': 'http://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBEX3.M.XAU.USD.EA.AC.C06&its_csvFormat=en&its_fileFormat=csv&mode=its', 
      'headers': ['date', 'price'], 
      'scheme': 'http', 
      'format': 'csv', 'encoding': 'utf-8-sig', 'schema': 'table-schema', 
      'errors': [
        {'code': 'extra-value', 'message': 'Row 7 has an extra value in column 3', 'row-number': 7, 'column-number': 3, 'row': ['1950-02', '34.730', '']}, 
        {'code': 'extra-value', 'message': 'Row 8 has an extra value in column 3', 'row-number': 8, 'column-number': 3, 'row': ['1950-03', '34.730', '']}
    ...

as a solution I modified flow.yaml for gold prices so that validator ignores that exact check for original data, passing skip_check parameter. See the commit https://github.com/datasets/gold-prices/commit/d1c65279059202c0dcdffb7f33841c056c0412bf for changes

zelima commented 5 years ago

FIXED