Archiver 2.0 (to go with QA 2.0)

davidread commented 8 years ago

This takes substantial improvements from the data.gov.uk team. It is intended to be merged with QA 2.0 https://github.com/ckan/ckanext-qa/pull/24

Key changes:

results of the archive are now stored in a dedicated Archival table, rather than in the TaskStatus objects (which were awful to query) and ResourceExtras (which makes it look like your dataset changed every time you archive, and if you're not careful this change triggers another archival and you get an infinite loop)
broken links report is done using the neat ckanext-report infrastructure, generated nightly, rather than generated on the fly in ckanext-qa templates. This avoids the big back-end load when you click on a report. Now the report is available to not just admins but the public too (let's be open about broken links - it helps to add pressure to getting them fixed). It also means that if you don't need the 5 stars of openness checking, you don't need to install ckanext-qa now.
it also talks WMS, so those sorts of servers can be checked if they are working too (extendable for other APIs too)
lots more tests and automatic Travis testing
rather than the data.gov.uk team working on their own fork, they contribute back to the whole. Whilst some teams have been using the root fork of ckanext-archiver, no-one has taken much responsibility for it and it is not well maintained.

To do:

[x] Merge changes from okfn master
[x] Change report HTML from Genshi to Jinja

davidread commented 8 years ago

@philipashlock I understand that you guys are working off forks of okfn/master for ckanext-archiver and ckanext-qa. I see you have a bunch of changes on your forks too:

I think there are similarities in what you've improved with what we've done at data.gov.uk, as seen in thei PR and https://github.com/ckan/ckanext-qa/pull/24 . I note in particular in QA that you have a new table qa_ids which seems a lot like the qa table we've added.

So the question is: are you happy for us to merge these PRs to change/improve ckanext-archiver and ckanext-qa. I think your forks are a fair distance from the main-line, so although your supply of fixes from there will dry up, it is probably low now anyway since they are relatively unmaintained. And undoubtably this change would increase the amount of work you'd have to do to get back. But I think if you ever do get to bring your forks towards the work on mainline then we have a lot of joint benefits. So conceptually are you happy for us to merge these?

davidread commented 8 years ago

@Zharktas @mostrovoi Hi there in Finland! You guys appear to base your substantial work on the data.gov.uk fork of archiver & qa. Does that mean you'd support these two PRs to make them the v2 of the mainline? And can you say a little about your comparison of the existing mainline and the data.gov.uk fork (so it's not just me saying ours is better!)

davidread commented 8 years ago

@drmalex07 Hi there at Public Mundi! Since you've also done some work on archiver and qa, forking from the mainline, could you comment on this PR? Are you happy for archiver/qa to unite with the work at data.gov.uk - obviously it creates work for you if you choose to come to this new version, but hopefully it is a better base and more supported.

davidread commented 8 years ago

@kapucko (microcomp) and @rparrapy & @verena91 (senatics) you've all done some work on archiver & qa, forking from mainline. Would you also be happy for archiver/qa to unite with the work at data.gov.uk - obviously it creates work for you if you choose to come to this new version, but hopefully it is a better base and more supported.

Zharktas commented 8 years ago

@davidread I fully support merging these. We moved to datagovuk's forks because mainline ckanext-qa was quite simplistic and datagovuk's fork was far superior in identifying files and reporting them. We will most likely move to the mainline again if these get merged at some point.

davidread commented 8 years ago

Many thanks for the feedback @Zharktas. Still hoping to hear from @philipashlock, @drmalex07, @kapucko and @rparrapy or @verena91. If we can aim to respond by the end of this week then we can decide then.

drmalex07 commented 8 years ago

@davidread Hello David. I have not neglected this, we will respond within the next 2 days! Thanks for your patience!

davidread commented 8 years ago

@drmalex07 later this week is fine - I'd much rather people considered the code and the principles of it fully, to give this project the best possible future. Thanks!

mostrovoi commented 8 years ago

@davidread Merging your fork to mainline again shouldn't be very difficult I believe. When we started using your fork there was just a couple of your own external references that we ended up integrating in our own customized fork. We were also wondering if would be possible to integrate qa and report in just one module since qa can't work without archiver and archiver by itself only provides a cache of the resources. I believe setup would be easier as well as maintenance right?

davidread commented 8 years ago

@mostrovoi thanks for your comments and I'm pleased to hear you're happy that the proposed merge isn't too drastic.

I've just looked at the changes on your branch https://github.com/yhteentoimivuuspalvelut/ckanext-archiver and have pulled across what we were missing - so THANKS for all of these: the helper name fix, CKAN 2.3 compatibility with ResourceGroup. I'm hesitant of moving the requirements from setup.py into pip-requirements.txt as it makes it another step to install, but if people want that then no problem. And we'll leave the Celery 3.0 stuff for another day as I just need to coordinate testing that upgrade with a couple of other uses my team have of it (inventory & packagezip) and I'll have to look to see if others are using it too.

Regarding integrating Archiver and QA, I see the archived resources as an enabler for lots of utilities - QA's 5 stars of openness is just one of many. For instance at data.gov.uk we have ckanext-packagezip which zips up all of a dataset's archived resources. We also looking at reporting on the files' text encoding (UTF8 or otherwise). It's worth noting that the broken links report is moving from QA to Archiver in this PR, and that functionality is a lot more popular than the 5 stars one, so I think there will be plenty of sites happy to just have that and not install QA. Also, the mainline QA update was triggered by a resource being created/updated, rather than the archiver doing its work, so it was only with a bit of luck that the QA occurred after an archival, and so in that case it would be better to run them in one task. However with this PR, the QA is triggered by an archival, using the IPipe interface (i.e. a way for an extension to subscribe to archival events). So I think having them separate makes sense, since they have loose coupling now, and each is quite cohesive.

rparrapy commented 8 years ago

@davidread I am no longer formally related to SENATICs, but I do think this merge makes sense, particularly if it leads to better maintenance of the mainstream project. To get a more official opinion on the matter let me invite @juanpane and @OpenDataGovPY to the discussion.

davidread commented 8 years ago

@rparrapy Thanks for that - all opinions welcome on this :+1:

drmalex07 commented 8 years ago

Hello @davidread.

I think that changes related to the storage of computed/discovered metadata of archived resources are to the right direction. We also had a problem, because archiving of a resource had the side effect of re-updating the resource (thus triggering a new DomainObject notification, which by some acrobatics you had to detect and ignore). I suppose, the only (small) downside is that you miss some handy metadata attributes as size (computed at https://github.com/ckan/ckanext-archiver/blob/master/ckanext/archiver/tasks.py#L176).

I don't really have to comment something on the QA part, as for now we are not using it.

As a general note (from a quick look to the datagovuk branch) i have to say that the actual code seems cleaner and comprehensible.

davidread commented 8 years ago

Thanks for all the responses. No-one's come out against this merge, 4 parties have stated their support here, plus 1 at the related one at https://github.com/ckan/ckanext-qa/pull/24 plus the support of @amercader and the CKAN tech team, so I'm going to do the merge this week and I look forward to working with you all on this better maintained main-line of Archiver & QA!

ckan / ckanext-archiver

Archiver 2.0 (to go with QA 2.0) #15