frictionlessdata / data-quality-dashboard

Data Quality Dashboards display statistics on a collection of published data.
Other
33 stars 10 forks source link

Deploy a number of example instances against CKAN instances #68

Closed pwalsh closed 8 years ago

pwalsh commented 8 years ago

Description

With the new CKAN database generator in the CLI, we can read any public CKAN instance and generate a publishers.csv and sources.csv with all the tabular data in that instance, and then run data quality assessments on these instances ( hello @Stephen-Gates ! ).

So, let's setup a number of them in free heroku dynos and start to look at the results.

Tasks

Stephen-Gates commented 8 years ago

Great news @pwalsh - has the potential to take the local census up a notch and start a competition on a broader basis with a stronger quality focus. Watching with interest. Let me know if I can help.

pwalsh commented 8 years ago

@Stephen-Gates great. just follow this issue. BTW, some of these data quality ideas will also land straight in the census code this year too :).

pwalsh commented 8 years ago

@jobarratt please add to the list any other examples (Northern Ireland?)

jobarratt commented 8 years ago

Have added NI becasue it's a definite, but i will add others. Would be nice to use either http://www.landcareresearch.co.nz/home or http://iatiregistry.org/ so we have a non-government example.

pwalsh commented 8 years ago

@georgiana-b

List all the errors you are encountering here, now, before investigating them.

georgiana-b commented 8 years ago

Errors encountered for https://data.qld.gov.au/ on GoodTables (0.7.4):

I found workarounds for most of them except the last, but I put them all here in chronological order in case my workarounds caused another following error to happen.

Traceback (most recent call last):
  File "/home/g/.virtualenvs/uk-spend/bin/dq", line 9, in <module>
    load_entry_point('data-quality==0.1.1', 'console_scripts', 'dq')()
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/data_quality/main.py", line 55, in run
    batch.run()
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 144, in run
    encoding=data['encoding'])
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 133, in pipeline_factory
    **options)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/pipeline.py", line 115, in __init__
    header_index=self.header_index)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 47, in __init__
    self.headers, self.values = self.extract(self.passed_headers)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 63, in extract
    headers = headers or self.get_headers(self.stream, reader)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 213, in get_headers
    return headers

Workaround.

  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 142, in run_pipeline_instance
    encoding=data_row['encoding'])
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 134, in pipeline_factory
    **options)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/pipeline.py", line 115, in __init__
    header_index=self.header_index)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 44, in __init__
    self.stream = self.to_textstream(self.data_source)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 99, in to_textstream
    data_source = format_handler(data_source)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 187, in excel_data_source
    ).isoformat()
ValueError: year is out of range

Workaround.

  File "/home/g/.virtualenvs/uk-spend/bin/dq", line 9, in <module>
    load_entry_point('data-quality==0.1.1', 'console_scripts', 'dq')()
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/data_quality/main.py", line 55, in run
    batch.run()
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 154, in run
    self.reports = pool.map(self.run_pipeline_instance, self.dataset, chunksize=7)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 599, in get
    raise self._value
http.client.IncompleteRead: IncompleteRead(898623 bytes read, 2230 more expected)

Workaround.

  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 142, in run_pipeline_instance
    encoding=data_row['encoding'])
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 134, in pipeline_factory
    **options)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/pipeline.py", line 115, in __init__
    header_index=self.header_index)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 44, in __init__
    self.stream = self.to_textstream(self.data_source)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 99, in to_textstream
    data_source = format_handler(data_source)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 166, in excel_data_source
    workbook = xlrd.open_workbook(data_source)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/xlrd/__init__.py", line 395, in open_workbook
    with open(filename, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'http://www.communities.qld.gov.au/resources/open-data/companion-card.xlsx'
""

Workaround: not yet found.

pwalsh commented 8 years ago

@georgiana-b

FileNotFoundError: [Errno 2] No such file or directory: 'http://www.communities.qld.gov.au/resources/open-data/companion-card.xlsx'

Are you expecting a local path to have such a name?

pwalsh commented 8 years ago

@georgiana-b

http.client.IncompleteRead: IncompleteRead(898623 bytes read, 2230 more expected)

Is this consistently reproducible on particular files? If yes, that I guess it is a problem with the server itself, so just handle the error, as you do in your workaround, and maybe consider it a 404 in terms of scoring etc.

pwalsh commented 8 years ago

@georgiana-b

ValueError: year is out of range The error is self-explanatory. it seems the work around is reasonable at a glance.

pwalsh commented 8 years ago

@georgiana-b

UnboundLocalError: local variable 'headers' referenced before assignment

I'm not happy with the workaround, unless you have clearly identified why you get this problem which is definitely related to this line https://github.com/georgiana-b/goodtables/blob/feature/parallel_batch_processing/goodtables/datatable/datatable.py#L208

georgiana-b commented 8 years ago

https://data.gov.au/ has some non-standard resources that can't be handled by CkanGenerator. Ex: https://gist.github.com/georgiana-b/485135b55c8309c1199aa990a05e13d1 lacks publisher_id. Similar issue encountered for http://iatiregistry.org/. Ex: https://gist.github.com/georgiana-b/d6da60bf5e6b0680711a45d254a37fc3

georgiana-b commented 8 years ago

Progress status:

pwalsh commented 8 years ago

@georgiana-b we only need created for timeliness, and we wouldn't be measuring timeliness for IATI (or any of the above except uk-25k) anyway, right?

georgiana-b commented 8 years ago

Right. However, we also aggregate performance based on it, and display based on it in the dashboard so we decided some time ago to remove support for empty created_at sources. If it's really important to have a IATI database, I can make a custom generator for it.

pwalsh commented 8 years ago

@georgiana-b ok, so forget IATI for now

pwalsh commented 8 years ago

@georgiana-b @jobarratt what's the status with the unticked sites?

georgiana-b commented 8 years ago

On Australian data, I keep getting this error:

Traceback (most recent call last):
  File "/home/g/.virtualenvs/uk-spend/bin/dq", line 9, in <module>
    load_entry_point('data-quality==0.1.1', 'console_scripts', 'dq')()
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/data_quality/main.py", line 55, in run
    batch.run()
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 154, in run
    self.reports = pool.map(self.run_pipeline_instance, self.dataset, chunksize=7)
  File "/usr/lib64/python3.4/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib64/python3.4/multiprocessing/pool.py", line 599, in get
    raise self._value
  File "/usr/lib64/python3.4/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib64/python3.4/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 143, in run_pipeline_instance
    result, report = pipeline.run()
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/pipeline.py", line 272, in run
    self.data.replay()
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 58, in replay
    raise e
  File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 52, in replay
    if self.stream.seekable():
ValueError: I/O operation on closed file.

However, I haven't managed to find the cause for it or a pattern, so I'm still investigating.

pwalsh commented 8 years ago

@georgiana-b for the last bug: find a file it happens on, and raise an issue on goodtables. I'm closing this issue now, and please post info on the deployed instances to our slack.