ckan / ckanext-archiver

Archive CKAN resources
MIT License
22 stars 46 forks source link

Resource URL when CKAN is in subpath #16

Closed AdrianMBarrera closed 8 years ago

AdrianMBarrera commented 8 years ago

Hi,

We've installed CKAN 2.5 and we're trying this extension. Our CKAN site is in a subpath.

When we run the Archiver update command, the resource's path changes (the subpath gets removed). As result, the Archiver detects this resource as a broken link.

Our configuration:

ckan.site_url = http://ourcatalog.org/data ckanext-archiver.archive_dir = /var/www/html/cache-datasets ckanext-archiver.cache_url_root = http://ourcatalog.org/data/resource_cache ckanext-archiver.max_content_length = 50000000 ckanext-archiver.user_agent_string = "Our Catalog (CKAN)"

Any ideas?

Thank you!

davidread commented 8 years ago

What do you mean the "resources's path changes"? The resource.url? There were some versions of the archiver that messed with the resource.url, but shouldn't do that with the latest.

AdrianMBarrera commented 8 years ago

Yes, the resource.url. For example:

We are running the latest version of the archiver.

Thank you, @davidread

davidread commented 8 years ago

That's the URL of the web page for the resource, rather than the URL of the data that the resource points to. I don't see how the URL of the web page is related to archiver. Nor why that it would say it is a broken link, since that checks the URL of the data.

Can't you give the real URL of your site?

AdrianMBarrera commented 8 years ago

Definitely!

You can check this URL: http://taro-des.fg.ull.es/datos/dataset/dataset-1/resource/b6bd7843-f3ae-456b-8e93-4f830db99bbe

Thank you for your efforts :D

davidread commented 8 years ago

Adrian, that helps a lot - thanks.

Firstly I don't know why it is changing the resource.url - that isn't right. You definitely are running the latest commit of ckan/ckanext-archiver? Reading through the code again I simply can't see how this could happen (or even change the resource or dataset at all). I'd appreciate you debugging it to found out where that occurs. You could insert from ckan import model; print "1", model.Session.dirty statements at the start of all the functions in tasks.py, restart the priority queue and watch its log when you do paster --plugin=ckanext-archiver archiver update dataset-1 --queue=priority -c <path to CKAN config>. When resource is edited then it should appear in the 'dirty' list.

Secondly, there is something odd going on writing the URL. Compare the URL seen in these two similar calls:

You'll see that the latter way is correct with the /dados/ in there. The difference between the calls is that the former goes through a validation/transformation 'schema' and I think it is also cached by SOLR. So maybe there is something wrong with the schema or your SOLR cache of the dataset. Have you customized the schema? You could try paster search-index rebuild dataset-1 and see if it helps. And finally make sure you are using the latest version of CKAN (or patch release), since there was a problem to do with revisions here: https://github.com/ckan/ckan/issues/1779 which is possible related.

AdrianMBarrera commented 8 years ago

Hi @davidread, sorry for the late response and thank you for your help, we really appreciate it.

Yes, we have installed the latest version of ckan and ckanext-archiver.

After a lot of research, we have found out the function that causes us problems.

In tasks.py:

def _update_search_index(package_id, log):
    '''
    Tells CKAN to update its search index for a given package.
    '''
    from ckan import model
    from ckan.lib.search.index import PackageSearchIndex
    package_index = PackageSearchIndex()
    context_ = {'model': model, 'ignore_auth': True, 'session': model.Session,
                'use_cache': False, 'validate': False}
    package = toolkit.get_action('package_show')(context_, {'id': package_id})
    package_index.index_package(package, defer_commit=False)
    log.info('Search indexed %s', package['name'])

By changing this line: context_ = {'model': model, 'ignore_auth': True, 'session': model.Session, 'use_cache': False, 'validate': False}

to: context_ = {'model': model, 'ignore_auth': True, 'session': model.Session, 'validate': False}

our problem is solved. We don't know if this is the correct solution, but... it works.

Now, we are dealing with other issue, related with nginx server.

davidread commented 8 years ago

Great stuff - it's good to shed some light on this, pin-pointing where the change is happening. However I'm afraid this change probably won't help in the long run. At this point in the code it is refreshing the SOLR index of this dataset, but by removing 'use_cache': False it will be getting the cached dataset and putting that in the SOLR index. i.e. you're not changing the SOLR index of it, so I think your archiver info will be out of date. And the next time you edit the dataset it will index properly and I believe you'll have the resource.url corrupted again. You could confirm this by doing a manual reindex: paster search-index rebuild dataset-1. That's basically what those lines should be doing anyway.

It makes me think of a new thing that might well be related to the problem though. There is a sneaky little bit which changes the resource.url if it doesn't start with 'http', which is the case for you I believe - you have a url of areas.txt. This is to ensure that the URL still works when the dataset is harvested to another domain. The code is here: https://github.com/ckan/ckan/blob/cd53881325f87a5c2e92257507bea5a0f638f2bf/ckan/lib/dictization/model_dictize.py#L126-L127 I can't see why this is causing your exact problem, but it is well worth you creating your datasets in the first place with an absolute URL to avoid this contributing to the problem.

AdrianMBarrera commented 8 years ago

Yes, you were right. After reindexing the resource.url got corrupted again.

We've disabled the archiver extension and the error persists. So, we think this probably is an error in our CKAN installation or a CKAN bug (not really related to this extension).

We're still investigating the issue so any help would be greatly appreciated.

davidread commented 8 years ago

Do add the full URL to the resource.url and we can take it from there.

I'll close this ticket since it doesn't sound like an archiver error. Feel free to email me to discuss further.