Closed pwalsh closed 8 years ago
Great news @pwalsh - has the potential to take the local census up a notch and start a competition on a broader basis with a stronger quality focus. Watching with interest. Let me know if I can help.
@Stephen-Gates great. just follow this issue. BTW, some of these data quality ideas will also land straight in the census code this year too :).
@jobarratt please add to the list any other examples (Northern Ireland?)
Have added NI becasue it's a definite, but i will add others. Would be nice to use either http://www.landcareresearch.co.nz/home or http://iatiregistry.org/ so we have a non-government example.
@georgiana-b
List all the errors you are encountering here, now, before investigating them.
Errors encountered for https://data.qld.gov.au/ on GoodTables (0.7.4):
I found workarounds for most of them except the last, but I put them all here in chronological order in case my workarounds caused another following error to happen.
UnboundLocalError: local variable 'headers' referenced before assignment
- for file http://tmr.qld.gov.au/~/media/aboutus/corpinfo/Open%20data/Newbusinessregistrationtransactions/newbusinessregistrationtransactionsbyservicelocation.csv
Traceback:Traceback (most recent call last):
File "/home/g/.virtualenvs/uk-spend/bin/dq", line 9, in <module>
load_entry_point('data-quality==0.1.1', 'console_scripts', 'dq')()
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/data_quality/main.py", line 55, in run
batch.run()
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 144, in run
encoding=data['encoding'])
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 133, in pipeline_factory
**options)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/pipeline.py", line 115, in __init__
header_index=self.header_index)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 47, in __init__
self.headers, self.values = self.extract(self.passed_headers)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 63, in extract
headers = headers or self.get_headers(self.stream, reader)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 213, in get_headers
return headers
ValueError: year is out of range
for https://www.health.qld.gov.au/opendata/docs/qas/qas-incidents-lasn-apr2015-mar2016.xlsx
Traceback:
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 142, in run_pipeline_instance
encoding=data_row['encoding'])
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 134, in pipeline_factory
**options)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/pipeline.py", line 115, in __init__
header_index=self.header_index)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 44, in __init__
self.stream = self.to_textstream(self.data_source)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 99, in to_textstream
data_source = format_handler(data_source)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 187, in excel_data_source
).isoformat()
ValueError: year is out of range
http.client.IncompleteRead: IncompleteRead(898623 bytes read, 2230 more expected)
for
http://www.dnrm.qld.gov.au/__data/assets/excel_doc/0011/187490/announced-entitlements.xlsx
or http://ehp.qld.gov.au/data-sets/emup_201301010000-201312312350_corrected_masterdata.csv
Traceback:
File "/home/g/.virtualenvs/uk-spend/bin/dq", line 9, in <module>
load_entry_point('data-quality==0.1.1', 'console_scripts', 'dq')()
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/data_quality/main.py", line 55, in run
batch.run()
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 154, in run
self.reports = pool.map(self.run_pipeline_instance, self.dataset, chunksize=7)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 599, in get
raise self._value
http.client.IncompleteRead: IncompleteRead(898623 bytes read, 2230 more expected)
FileNotFoundError: [Errno 2] No such file or directory: 'http://www.communities.qld.gov.au/resources/open-data/companion-card.xlsx'
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/multiprocess/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 142, in run_pipeline_instance
encoding=data_row['encoding'])
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 134, in pipeline_factory
**options)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/pipeline.py", line 115, in __init__
header_index=self.header_index)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 44, in __init__
self.stream = self.to_textstream(self.data_source)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 99, in to_textstream
data_source = format_handler(data_source)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 166, in excel_data_source
workbook = xlrd.open_workbook(data_source)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/xlrd/__init__.py", line 395, in open_workbook
with open(filename, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'http://www.communities.qld.gov.au/resources/open-data/companion-card.xlsx'
""
Workaround: not yet found.
@georgiana-b
FileNotFoundError: [Errno 2] No such file or directory: 'http://www.communities.qld.gov.au/resources/open-data/companion-card.xlsx'
Are you expecting a local path to have such a name?
@georgiana-b
http.client.IncompleteRead: IncompleteRead(898623 bytes read, 2230 more expected)
Is this consistently reproducible on particular files? If yes, that I guess it is a problem with the server itself, so just handle the error, as you do in your workaround, and maybe consider it a 404 in terms of scoring etc.
@georgiana-b
ValueError: year is out of range
The error is self-explanatory. it seems the work around is reasonable at a glance.
@georgiana-b
UnboundLocalError: local variable 'headers' referenced before assignment
I'm not happy with the workaround, unless you have clearly identified why you get this problem which is definitely related to this line https://github.com/georgiana-b/goodtables/blob/feature/parallel_batch_processing/goodtables/datatable/datatable.py#L208
https://data.gov.au/ has some non-standard resources that can't be handled by CkanGenerator. Ex: https://gist.github.com/georgiana-b/485135b55c8309c1199aa990a05e13d1 lacks publisher_id
.
Similar issue encountered for http://iatiregistry.org/. Ex: https://gist.github.com/georgiana-b/d6da60bf5e6b0680711a45d254a37fc3
Progress status:
created
; dropped in favor of http://www.landcareresearch.co.nz@georgiana-b we only need created
for timeliness, and we wouldn't be measuring timeliness for IATI (or any of the above except uk-25k
) anyway, right?
Right. However, we also aggregate performance based on it, and display based on it in the dashboard so we decided some time ago to remove support for empty created_at
sources. If it's really important to have a IATI database, I can make a custom generator for it.
@georgiana-b ok, so forget IATI for now
@georgiana-b @jobarratt what's the status with the unticked sites?
On Australian data, I keep getting this error:
Traceback (most recent call last):
File "/home/g/.virtualenvs/uk-spend/bin/dq", line 9, in <module>
load_entry_point('data-quality==0.1.1', 'console_scripts', 'dq')()
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/data_quality/main.py", line 55, in run
batch.run()
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 154, in run
self.reports = pool.map(self.run_pipeline_instance, self.dataset, chunksize=7)
File "/usr/lib64/python3.4/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib64/python3.4/multiprocessing/pool.py", line 599, in get
raise self._value
File "/usr/lib64/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib64/python3.4/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/batch.py", line 143, in run_pipeline_instance
result, report = pipeline.run()
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/pipeline/pipeline.py", line 272, in run
self.data.replay()
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 58, in replay
raise e
File "/home/g/.virtualenvs/uk-spend/lib/python3.4/site-packages/goodtables/datatable/datatable.py", line 52, in replay
if self.stream.seekable():
ValueError: I/O operation on closed file.
However, I haven't managed to find the cause for it or a pattern, so I'm still investigating.
@georgiana-b for the last bug: find a file it happens on, and raise an issue on goodtables. I'm closing this issue now, and please post info on the deployed instances to our slack.
Description
With the new CKAN database generator in the CLI, we can read any public CKAN instance and generate a
publishers.csv
andsources.csv
with all the tabular data in that instance, and then run data quality assessments on these instances ( hello @Stephen-Gates ! ).So, let's setup a number of them in free heroku dynos and start to look at the results.
Tasks