ckan / ckanext-archiver

Archive CKAN resources
MIT License
21 stars 46 forks source link

Don't archive uploaded resources #36

Closed KrzysztofMadejski closed 7 years ago

KrzysztofMadejski commented 7 years ago

My latest archiver: https://github.com/ckan/ckanext-archiver/commit/5668c5377351e93babe0fa2ea3d38cfd4d693f38 + #33 PR archives resources that were uploaded to server. It's a waste of space as these are files are already stored on the server.

Proposed solution:

    [2017-03-15 11:07:56,236: INFO/MainProcess] Got task from broker: archiver.update_package[localhost/ecaa]
[2017-03-15 11:07:56,590: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Starting update_package task: package_id=u'514dcbb8-fabe-40cb-927a-3e408c431dfa' queue=priority
[2017-03-15 11:07:56,749: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Attempting to download resource: http://localhost/dataset/514dcbb8-fabe-40cb-927a-3e408c431dfa/resource/6be642a7-0e53-498c-b8ef-a8ca5e5062e6/download/b08a4d7b01a04852b914e7904a73b1b8.png
[2017-03-15 11:07:56,756: INFO/PoolWorker-1] Starting new HTTP connection (1): localhost
[2017-03-15 11:07:58,842: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: GET started successfully. Content headers: {'transfer-encoding': 'chunked', 'accept-ranges': 'bytes', 'server': 'Apache/2.4.7 (Ubuntu)', 'last-modified': 'Wed, 15 Mar 2017 10:07:56 GMT', 'content-range': 'bytes 0-22897/22898', 'etag': '"1489572476.31-22898"', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'date': 'Wed, 15 Mar 2017 10:07:56 GMT', 'content-type': 'image/png'}
[2017-03-15 11:07:58,842: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: Downloading the body
[2017-03-15 11:07:58,842: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: Saving resource
[2017-03-15 11:07:58,843: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: Resource saved. Length: 22898 File: /tmp/tmpz8ExfA
[2017-03-15 11:07:58,843: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: Resource downloaded: id=6be642a7-0e53-498c-b8ef-a8ca5e5062e6 url='http://localhost/dataset/514dcbb8-fabe-40cb-927a-3e408c431dfa/resource/6be642a7-0e53-498c-b8ef-a8ca5e5062e6/download/b08a4d7b01a04852b914e7904a73b1b8.png' cache_filename=/tmp/tmpz8ExfA length=22898 hash=df011021f5d8135f55b8377160ba5503dbd6c316
[2017-03-15 11:07:58,843: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Attempting to archive resource
[2017-03-15 11:07:58,843: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Going to do chmod: /home/ckan/data/archiver/6b/6be642a7-0e53-498c-b8ef-a8ca5e5062e6/b08a4d7b01a04852b914e7904a73b1b8.png
[2017-03-15 11:07:58,844: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Archived resource as: /home/ckan/data/archiver/6b/6be642a7-0e53-498c-b8ef-a8ca5e5062e6/b08a4d7b01a04852b914e7904a73b1b8.png
[2017-03-15 11:07:58,850: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Archival saved: <Archival Downloaded OK /dataset/localhost/resource/6be642a7-0e53-498c-b8ef-a8ca5e5062e6 >
[2017-03-15 11:07:58,857: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Notifying package as 1 items were archived
TkTech commented 7 years ago

Instead of checking site_url, why not check if the resource has type='upload'? Otherwise you'll archive resources that are "uploaded" but aren't living on the server like those uploaded by ckanext-cloudstorage.

KrzysztofMadejski commented 7 years ago

@amercader @davidread Any suggestions where this should be coded?

I guess either in a) https://github.com/ckan/ckanext-archiver/blob/master/ckanext/archiver/tasks.py#L232 ("Won't attemp to download") or in b) download function https://github.com/ckan/ckanext-archiver/blob/master/ckanext/archiver/tasks.py#L338 (checking type=upload as @TkTech suggested) (throwing ChooseNotToDownload?)

I'd go for a) and prefilling proper download function response using existing file:

    return {'mimetype': mimetype,
            'size': length,
            'hash': hash,
            'headers': dict(res.headers),
            'saved_file': saved_file_path,
            'url_redirected_to': url_redirected_to,
            'request_type': method}
davidread commented 7 years ago

Regarding @TkTech 's suggestion, if the file is in the cloud then ckanext-qa can't run a hash on it, or libmagic etc - you need a local copy for that, so I'd suggest sticking with your original plan, or adding a config option.

I'd start down the path of a) and see if it works out.

TkTech commented 7 years ago

@davidread what does ckanext-qa need? ckanext-cloudstorage can populate the mimetype and hash fields which are core fields on a resource as it's uploaded.

Having large resources (say >=4GB) uploaded to your s3/azure/whatever bucket, then downloaded locally again kind of defeats the point. You need to be able to disable this for things you're already hosting.

davidread commented 7 years ago

@TkTech clearly you can optimize for whichever situation - I'm not saying you're wrong. But ckanext-archiver is designed for providing ckanext-qa and ckanext-packagezip and the like with local file access. If you want a solution optimized for simply caching files except the ones like these, then I suggest you fork it.

KrzysztofMadejski commented 7 years ago

I've also understood that the main goal of archiver was caching local externally hosted files, which availability "we as dataportal provider" cannot assure. Coding qa as dependent on archiver is a wise solution, but I see that also other providers could offer qa the info it needs, though it's not trivial. Some feaures (like hash and libmagic) would need to be run on the storage server.

@TkTech I will start another issue for discussion on how ckanext-cloudstorage and ckanext-qa could work together.

TkTech commented 7 years ago

That was my understanding as well. I want ckanext-archiver and ckanext-qa to run only on files our portal does not already have complete control over (type='upload'), those being externally hosted. This seems like a common enough case and simple enough to do just as an ini setting unless I'm missing something?

davidread commented 7 years ago

A simple config setting for this is fine.

QA does various things too. It would make little sense to exclude some files from the 5 stars of openness report, for example. But if the bits of Archiver and QA you do want to use don't make sense to you for some files then having a simple config option for that is fine.