Closed tdipisa closed 6 years ago
Could you elaborate more about broken links? Do you mean links to resources that doesn't work, links to remote datasets that doesn't work or datasets/resources reported as broken during harvesting?
I've put together some preliminary working reports: https://github.com/geosolutions-it/ckanext-gsreport/pull/1#issuecomment-363708643
List of reports:
Resource types report:
License report:
Broken links report:
This one contains more information per row, because link can be broken in several ways. I've put some basic information in template, there's a bit more in json/csv payloads:
Reports can be customized with template, also we could add some filtering, to limit items requested (for example, limit report to specific organization or dataset). This could be helpful especially in broken links report.
@cezio
I tested on opendata-trento.geo-solutions.it (ckan 2.5.2) and the installation seems to works fine. Just some notes below:
I think we need to improve the broken links checking for WMS resources. In this dataset for example we have WMS resources that point to http://www.territorio.provincia.tn.it/ that's not a WMS URL.
I would like to have a nice Admin UI for report using cards like this
Please, in the resource format report indicate the total amount of existing resources
Please, check the broken links report. (Here)[http://opendata-trento.geo-solutions.it/report/broken-links] it seems that no broken links exist but (here)[http://opendata-trento.geo-solutions.it/dataset/aree-wifi-di-accesso-pubblico] we have some 'not found' (404) resources
Tested then in provbz demo server(ckan 2.4.0) . Some notes below:
I think we need to improve the broken links checking for WMS resources. In this dataset for example we have WMS resources that point to http://www.territorio.provincia.tn.it/ that's not a WMS URL.
True. This is however a link provided in resource. It will redirect to landing page with url to actual webgis site, when used in browser. Script will report this url as invalid, because in fact, it's not valid WMS endpoint. So, either its type should be changed to other value than WMS (HTML maybe), or url should be changed to valid WMS.
Please, check the broken links report. (Here)[http://opendata-trento.geo-solutions.it/report/broken-links] it seems that no broken links exist but (here)[http://opendata-trento.geo-solutions.it/dataset/aree-wifi-di-accesso-pubblico] we have some 'not found' (404) resources
From notes in PR (https://github.com/geosolutions-it/ckanext-gsreport/pull/1#issuecomment-363708643):
A word of caution: reports may take longer time to generate (especially broken links, which will test each resource). To avoid timeouts in UI, reports should be generated from CLI (cron?) with
paster --plugin=ckanext-report report generate [optional name of report] --config=conf.ini command
.
I would like to have a nice Admin UI for report using cards like this
I've added styling with bootstrap 3.x (code was using classes from bs3, although actual css was deployed with different module in datagov's installation), just for this view:
Layout is responsive, so narrow screen will look like this:
@cezio
for point 1. I know that the URL per se is OK (the site landing page) but probably when you are going to check an OWS resource you should take into account the resource type. So, even if the URL is not broken, you should just to notify a warning for this in broken link section to keep the admin informed that the URL is ok but the type does not conresponds
for point 2. I installed your updates bu I cannot see any changes here
for point 3. OK
for point 4. I already run the paster command but there's not broken links notified here
for point 2.I installed your updates bu I cannot see any changes here
Should work now. Slight misconfiguration, status_reports
plugin should be before report
. README is updated.
for point 1. I know that the URL per se is OK (the site landing page) but probably when you are going to check an OWS resource you should take into account the resource type. So, even if the URL is not broken, you should just to notify a warning for this in broken link section to keep the admin informed that the URL is ok but the type does not conresponds
So, what workflow should be used here? If we have resource with one of OWS formats, we should check if it's valid OWS endpoint, and if not, mark it as non-ows and do regular http check (and optionally mark it as invalid http endpoint if that also fails)?
@Cezio yes
For the record, I see some field for improvement in reports:
I've updated ckanext-gsreport, as requested. If OWS check will fail, regular http check is performed. If it success, resource will be marked with message:
@cezio,
there are problems while generating the broken link report in http://opendata-trento.geo-solutions.it/, below the link to the ckan.log file that you can also find in the related VM:
https://demo.geo-solutions.it/share/provinciabz/ckan.log
I think we need to properly manage timeouts, in any case I think you should investigate this in the demo instance.
As far as you can see, in that log we have also the strange logs I mentioned some days ago that need to be solved (see the top of the file), some fragments below:
...
File "/usr/local/lib/python2.7/BaseHTTPServer.py", line 108, in server_bind SocketServer.TCPServer.server_bind(self) File "/usr/local/lib/python2.7/SocketServer.py", line 430, in server_bind self.socket.bind(self.server_address) File "/usr/local/lib/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) socket.error: [Errno 98] Address already in use
Then there are still problems in the Reports UI, inside the provbz demo here that needs to be investigated.
Below the log reported running the generate command in provbz demo:
(default)[ckan@bolzano src]$ paster --plugin=ckanext-report report generate --config=/etc/ckan/default/production.ini 2018-02-12 12:23:51,804 INFO [ckanext.geonetwork.harvesters.geonetwork] GeoNetwork harvester: extending ISODocument with TimeInstant 2018-02-12 12:23:51,804 INFO [ckanext.geonetwork.harvesters.geonetwork] GeoNetwork harvester: adding old GML URI 2018-02-12 12:23:51,805 INFO [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-begin 2018-02-12 12:23:51,805 INFO [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-begin 2018-02-12 12:23:51,805 INFO [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-end 2018-02-12 12:23:51,805 INFO [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-end 2018-02-12 12:23:51,805 INFO [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-instant 2018-02-12 12:23:51,810 INFO [ckanext.multilang.harvesters.multilang] CSW Multilang harvester: extending ISODocument with PT_FreeText /usr/lib/ckan/default/src/ckan/ckan/new_authz.py:6: FutureWarning: ckan.new_authz has been renamed to ckan.authz. The ckan.new_authz module will be removed in a future release. FutureWarning) 2018-02-12 12:23:51,817 INFO [rdflib] RDFLib Version: 4.2.1 2018-02-12 12:23:52,327 INFO [ckan.lib.cli] Report generation complete {'All Reports': 3.0994415283203125e-06}
there are problems while generating the broken link report in http://opendata-trento.geo-solutions.it/,
If you mean timeout when opening report, it's because broken links will take time to generate (described in readme and in PR notes). I'll add it to cron, so it will be generated once a day. If it's some other problem, please describe it in more detail.
set -h VENV=/usr/local/app_loc/geosolutions/ DATAPUSHER_ROOT=/usr/local/data_loc/src/datapusher/datapusher CONFIG=/usr/local/app_loc/geosolutions//conf/datapusher_settings.py PYTHON_B=/usr/local/app_loc/geosolutions//bin/python source /usr/local/app_loc/geosolutions//bin/activate ++ deactivate nondestructive ++ unset -f pydoc ++ '[' -z '' ']' ++ '[' -z '' ']' ++ '[' -n /bin/bash ']' ... File "/usr/local/lib/python2.7/BaseHTTPServer.py", line 108, in server_bind SocketServer.TCPServer.server_bind(self) File "/usr/local/lib/python2.7/SocketServer.py", line 430, in server_bind self.socket.bind(self.server_address) File "/usr/local/lib/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) socket.error: [Errno 98] Address already in use
This has nothing to do with reports module. This is probably because datapusher was run manually, and not stopped before running it as supervisord's service.
Then there are still problems in the Reports UI, inside the provbz demo here that needs to be investigated. 2018-02-12 12:23:52,327 INFO [ckan.lib.cli] Report generation complete {'All Reports': 3.0994415283203125e-06}
Again, look at provided readme and notes. There's no status_reports
plugin in ckan.plugins
@cezio the first time I installed the gsreport in http://opendata-trento.geo-solutions.it the report generation was completed quickly with none of the messages I have now in the log (however, there was no broken links in the report that time).
I just need to understand if these logs I have now are normal or not (ok for configuring a cron in both machines)
I know that the log I reported about datapusher is not related to gsreport but should be solved in any case.
About the provbz demo UI report, sorry my fault. However some screenshot below:
However some screenshot below:
Thx. This looks like empty format in Resource. There are two cases actually. One is empty string, other is None (I'm not sure if it's stringified None
, or converted NULL
from db). How should this be handled in reports?
@cezio, In my opinion that should be handled ad not specified
@cezio, about your improvement suggestions:
we can add per-dataset or per-organization filtering
Can we provide a per-organization filtering now? Something like provided here
if report fails during generation, it should leave some trace. Currently all work during generation is discarded, if for some reason report generation fails
That's not reported in the usual ckan's log file?
requests could be cached (urls can be duplicated in datasets, especially when they refer to non-ckan resources, like webgis portals, where one url is used as access gate to multiple resources)
Can you provide an example?
requests could be parallelized to speed-up report generation (with multiprocessing module probably)
Not needed at this time but could be a nice improvement for the future
to avoid multiple instances of report generation running in cron, report generation should use lock
for the moment just one cron with a command that generates all reports in the same time is fine:
paster --plugin=ckanext-report report generate --config=conf.ini
filtering report by error type, resource type
Can this be doable now within the issue's size we have? I need to check the available time on this
@cezio, In my opinion that should be handled ad not specified
Fixed in https://github.com/cezio/ckanext-gsreport/commit/5ac2a43c9c56bed83651ae2ca8670e98e8b400fb
@cezio, about your improvement suggestions:
we can add per-dataset or per-organization filtering Can we provide a per-organization filtering now? Something like provided here
Yes, I can add this. Fix in progress.
if report fails during generation, it should leave some trace. Currently all work during generation is discarded, if for some reason report generation fails That's not reported in the usual ckan's log file?
Yes, but it won't be shown in reports UI. Admin won't know report has failed until logs are checked. This will cause some problems for users as well. User will click on report, which will start to generate (since there's no report saved), and get 500 error instead of meaningful message.
This could be improved by extending report extension, catching error, and save report with message that it failed for specific reason. User would be still free to regenerate it from UI.
requests could be cached (urls can be duplicated in datasets, especially when they refer to non-ckan resources, like webgis portals, where one url is used as access gate to multiple resources) Can you provide an example?
Sure, in tests we had urls to geoportals like http://www.territorio.provincia.tn.it/ which are present in 19 resources, and there are multiple similar cases in db:
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------
url | http://www.provinz.bz.it/natur-raum/themen/landeskartografie.asp
count | 252
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------
url | http://www.provinz.bz.it/natur-raum/
count | 104
-[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------
url | http://www.provinz.bz.it/strassendienst/
count | 75
-[ RECORD 4 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------
url | http://www.provinz.bz.it/forst/
count | 42
So, I suspect it's a real problem, especially when handling non-ckan datasets. Having responses cached during report generation would speed up report generation (we will get 1 connection timeout instead of 50 for example).
to avoid multiple instances of report generation running in cron, report generation should use lock for the moment just one cron with a command that generates all reports in the same time is fine: paster --plugin=ckanext-report report generate --config=conf.ini
Yes, but this doesn't prevent from running multiple instances of script, and this may be a case for larger installations. We use test instances, where we have fairly small number of resources, so report can be generated within ~5-10 minutes for ~1-2k resources. This may be a problem for deployments with 300-400k+ resources active.
filtering report by error type, resource type
Can this be doable now within the issue's size we have? I need to check the available time on this
It should be similar to filtering by organization. I'll need few hours to implement.
@cezio I updated the provbz demo and generated again the all reports. Now in resource format I have not specified = 921. If I click on the not specified link a list of dataset is shown. It seems that something wrong because the first dataset (named Catasto eventi: eventi di frana), for example, has two resources one in CSV format and one in WMS (the same for the other resources whitin listed dataset)
I updated the provbz demo and generated again the all reports. Now in resource format I have not specified = 921. If I click on the not specified link a list of resources is shown. It seems that something wrong because the first dataset (named Catasto eventi: eventi di frana), for example, has two resources one in CSV format and one in WMS (the same for the other resources whitin listed dataset)
This is because url is probably not working correctly for searching empty values. Fix in progress.
@cezio, I understand your points. Please open new issues in gsreport for each point above you reported with an exhaustive description in it and finally mark these as enhancement label. We will check later if we need to work on these issues. For the moment please work only the following: 1 - per-organization filtering 2 - url is probably not working correctly for searching empty values
For the moment please work only the following: 1 - per-organization filtering 2 - url is probably not working correctly for searching empty values
Implemented. See notes in linked issues (geosolutions-it/ckanext-gsreport#4 and geosolutions-it/ckanext-gsreport#3)
@cezio, below some reviews after my checks on your work.
Checks has been performed in provbz demo here:
TRAVIS BUILD
Should be fixed, it seems that the requirements.txt file has a wrong content
RESOURCE FORMAT COUNTS
1 - 'not specified': we have 910 of these (I think these are related to all of resorces named Unnamed resource). However, if I click on the link the dataset list is empty.
2 - 'WMS': we have 267 of these but the count in Facet list is 251. However if i click in the link the dataset list contains 251 dataset that's correct
3 - 'HTML': we have 6 of these but the count in Facet list is 2. However if i click in the link the dataset list contains 2 dataset that's correct
4 - 'CSV': we have 3 of these but the count in Facet list is 2. However if i click in the link the dataset list contains 2 dataset that's correct
5 - 'XLSX': we have 2 of these but the count in Facet list is 1. However if i click in the link the dataset list contains 1 dataset that's correct
6 - 'PDF': we have 2 of these but the count in Facet list is 1. However if i click in the link the dataset list contains 1 dataset that's correct
7 - If I access the report page using the localized path like the following:
http://188.165.62.180/it/report/resources-format
then I have a NOT FOUND error filtering by organizzation:
http://188.165.62.180/it/it/report/resources-format/pab-agricoltura
the same if I click on Refresh button
LICENSES COUNT
1 - 'notspecified': please correct the lable using 'not specified'
2 - If I access the report page using the localized path like the following:
http://188.165.62.180/it/report/licenses
then I have a NOT FOUND error filtering by organizzation:
http://188.165.62.180/it/it/report/licenses/pab-agricoltura
the same if I clink on Refresh button
BROKEN LIKNS
1 - If I access the report page using the localized path like the following:
http://188.165.62.180/it/report/broken-links
then I have a NOT FOUND error filtering by organizzation:
http://188.165.62.180/it/it/report/broken-links/pab-agricoltura
the same if I clink on Refresh button
2 - For some resources I see a correct report like the following :
But for other resources I have the following:
where the OWS exception response is the same:
REPORTS VISUALIZATION
1 - Only the admin user can access the /report endpoint
README
1 - We need to put a documentation to know that we can set a cron job for the report generate command
2 - We need to put a documentation that explain some useful information about the report core functionalities and behaviors, for example:
the report generation time depends on how fast is network and how fast resources are responding
all the records in DB are replaced with the new ones but after succesful generation afair so if report generation fails, old data will be preserved
how an user can be autorized to view a report
the report will be generated if there's no prior data otherwise, user will see calculated result
user can click 'refresh', and then report will be regenerated
@cezio, an additional review below:
The report generation process on http://opendata-trento.geo-solutions.it closes with an exception, below the generated error that should be investigated (probably we should harden some piece of code):
Traceback (most recent call last):
File "/usr/local/app_loc/geosolutions/bin/paster", line 11, in
TRAVIS BUILD
Should be fixed
RESOURCE FORMAT COUNTS 1 - 'not specified': we have 910 of these (I think these are related to all of resorces named Unnamed resource). However, if I click on the link the dataset list is empty.
This was because solr index should be rebuilded after using this plugin (there was a comment in source issue, but I'll add this to readme. reindex will add not-specified
placeholder to solr data, so those items will be searchable). I've run index rebuild on that vm.
2 - 'WMS': we have 267 of these but the count in Facet list is 251. However if i clink in the link the dataset list contains 251 dataset that's correct ... 6 - 'PDF': we have 2 of these but the count in Facet list is 1. However if i clink in the link the dataset list contains 1 dataset that's correct
All above are because report shows number of resources, and facets shows number of datasets found. I can add dataset count column to reports so it will be clear from where this number comes.
7 - If I access the report page using the localized path like the following: http://188.165.62.180/it/report/resources-format then I have a NOT FOUND error filtering by organizzation: http://188.165.62.180/it/it/report/resources-format/pab-agricoltura
Notice bad /it/it/
prefix in path. Fix in progress.
LICENSES COUNT 1 - 'notspecified': please correct the lable using 'not specified'
This is because specific dataset (http://188.165.62.180/dataset/test-dataset) has 'notspecified' license set.
Rest in progress.
7 - If I access the report page using the localized path like the following: http://188.165.62.180/it/report/resources-format then I have a NOT FOUND error filtering by organizzation: http://188.165.62.180/it/it/report/resources-format/pab-agricoltura Notice bad /it/it/ prefix in path. Fix in progress.
Actually this may be harder to fix. It looks like this is a bug in ckan, where 'redirect_to()` add language prefix to stringified url.
Basically, when you select option in form, form is submited with url:
This hits ckanext.report.controllers
: https://github.com/datagovuk/ckanext-report/blob/master/ckanext/report/controllers.py#L43-L46
And returns redirection to /en/report/resources-format/org1
(this is result of toolkit.url_for()
from https://github.com/datagovuk/ckanext-report/blob/master/ckanext/report/helpers.py#L25). And this is where redirect_to()
adds lang code to redirected url.
@cezio, please remove also the double colon in org filter:
@cezio, please remove also the double colon in org filter:
There is no Organization :
string in translation in ckanext-gsreport
. Can this come from i18n from other extension?
This fragment is rendered from https://github.com/cezio/ckanext-gsreport/blob/master/ckanext/gsreport/templates/report/option_organization.html#L9
@cezio, please remove also the double colon in org filter:
Addressed with https://github.com/cezio/ckanext-gsreport/commit/33a52a658cb8a8dd769afde01af06a8db3995575
Only the admin user can access the /report endpoint
Addressed with https://github.com/cezio/ckanext-gsreport/commit/1c2fa79f400ec72fa0ea59c4b30198a4eeb38329. One note here: implementation is a bit hackish, because ckan dissalow auth functions override once they are defined. In this case, our plugin defines auth functions first, report
does it again, so to avoid exception, I've replaced IAuthFunctions.get_auth_functions
from report
plugin, so it will return empty dict.
@cezio, thank you for your fixes. Some notes below:
auth functions do we really need this hack? I would like to avoid it if possible.
from the ckan's doc:
" ... Note that by default, all auth functions provided by extensions are assumed to require a validated user or API key, otherwise a ckan.logic.NotAuthorized: exception will be raised. This check will be performed before calling the actual auth function. If you want to allow anonymous access to one of your actions, its auth function must be decorated with the auth_allow_anonymous_access decorator, available on the plugins toolkit."
redirect to (double lang path in URL: /it/it)
has this behavior been fixed in new ckan's versions ? Can we provide a patch for this in our extension ?
travis build It seems that the related PR from your master has some issues to solve in the travis build.
@cezio, another and most important issue here using your master branch. If I run the following command:
paster --plugin=ckanext-report report generate --config=/usr/local/app_loc/geosolutions/conf/production.ini
the following error occurred both here and here:
Traceback (most recent call last):
File "/usr/local/app_loc/geosolutions/bin/paster", line 11, in
auth functions do we really need this hack? I would like to avoid it if possible.
If we want to allow only admin user, yes.
from the ckan's doc: " ... Note that by default, all auth functions provided by extensions are assumed to require a validated user or API key, otherwise a ckan.logic.NotAuthorized: exception will be raised. This check will be performed before calling the actual auth function. If you want to allow anonymous access to one of your actions, its auth function must be decorated with the auth_allow_anonymous_access decorator, available on the plugins toolkit."
This says only about authenticated users, not admin-level users.
redirect to (double lang path in URL: /it/it) has this behavior been fixed in new ckan's versions ? Can we provide a patch for this in our extension ?
This seems to be fixed in master: https://github.com/ckan/ckan/issues/3499 I'll check if we can backport this somehow to extension (i suspect that means monkey-patching url_for() implementation).
travis build It seems that the related PR from your master has some issues to solve in the travis build.
investigating.
@cezio redirect to (double lang path in URL: /it/it): it seems just partially fixed. go to the following URL and then click on the refresh button:
http://188.165.62.180/it/report/resources-format
auth functions: maybe we can allow also authenticated user. I will do another check before. thank you
travis build: ok, let me know.
redirect to (double lang path in URL: /it/it): it seems just partially fixed. go to the following URL and then click on the refresh button:
Checking.
@cezio, another and most important issue here using your master branch. If I run the following command: the following error occurred both here and here:
should be fixed
redirect to (double lang path in URL: /it/it): it seems just partially fixed. go to the following URL and then click on the refresh button:
This was probably because code for gsreport extension was not updated. i've updated to latest master, seems to work, but please confirm.
@cezio see the following from my tests yesterday:
@cezio what about my comment here? What you did for fixing the problem?
@cezio see the following from my tests yesterday:
Thx. I can reproduce. fix in progress.
@cezio what about my comment here? What you did for fixing the problem?
Fix for this is a part of PR: https://github.com/geosolutions-it/ckanext-gsreport/pull/8/commits/b26161f67ea02f0b6da0329cc490d7ac7adbdbe4
@cezio as we discussed:
1) now is not possible to change the language from the drop down in the footer
2) please check if possible to apply a quick patch to our extension to solve the redirect_to remaining issues , if not possible pleade hide the refresh button.
3) if you hide the refresh button please update also the invoved part on the readme documentation.
4) After the points above I will proceed to a new check to merge the pull requests (I will close #7 and #8 whithout merging them). I think you can direclty include the translations updates in #7 in the PR #9 ok?
if you hide the refresh button please update also the invoved part on the readme documentation.
button removed, readme updated.
Provide a report module for dataset and resources in the catalog. The aim is to know:
We should consider to use the ckanext-report extension and configure the needed reports.
Some real cases linked below: