OCHA-DAP / hdx-ckan

A repo for HDX's configurations and extensions to CKAN
Other
75 stars 25 forks source link

Possible cache management issue arises from using the perma_link as the url for resource downloads #1922

Closed takavarasha closed 9 years ago

takavarasha commented 9 years ago

It may take a very long time (several hours in a case we observed yesterday) from the time the file for a resource is updated to the time that users who click on the download link can download the updated file. During this period, users are served the old file when they click on the download resource button. The updated file is available on CKAN, and this can be verified by either using the CKAN API to get the URL of the file for the resource, or for files that can be previewed, by clicking on the preview button and using url provided on the preview page. This issue was experienced on the following dataset: https://data.hdx.rwlabs.org/dataset/bed-capacity

danmihaila commented 9 years ago

@alexandru-m-g any feedback on this?

alexandru-m-g commented 9 years ago

I think we should discuss this with @teodorescuserban . Is it an nginx cache that causes this ? I know I've tried this a few times ( not sure if on production, but surely on staging ) and I got the new version each time.

alexandru-m-g commented 9 years ago

@takavarasha @danmihaila @cjhendrix I've tried to reproduce this today both on staging and on production. I did the following:

  1. Uploaded file A in a dataset, then downloaded it
  2. Updated file A, downloaded from same user. I was able to see the updated file.
  3. Updated file A as a different user, then tried to download the updated file as that user. I was able to see the updated file
  4. Updated file A as a different user, then tried to download the updated file as the initial user. I was able to see the updated file
  5. I've tried updating exactly the dataset mentioned in the ticket (but on staging) https://test-data.hdx.rwlabs.org/dataset/bed-capacity. I downloaded it, updated it and then again downloaded it. I could see the change
  6. I did the same as above with the downloader user being logged out and updated being logged in

Also checked, Pragma and Cache-Control are set to no-cache in the HTTP headers when downloading the file

I need some help in reproducing this.

cjhendrix commented 9 years ago

@takavarasha Could you try to reproduce this, please? I know when I am testing, it's easy to get a bit mixed up and think one thing has happened when in fact it was something else. I'm hoping that's the case with this one.

cjhendrix commented 9 years ago

Pinging @takavarasha for an update when you have time. We couldn't replicate this one.

takavarasha commented 9 years ago

I will try replicating this from our end. The file could have been cached elsewhere.

takavarasha commented 9 years ago

I manged to replicate the issue from within the UN network today with @luiscape. We will try replicating the issue from outside the UN network and revert with our findings.

luiscape commented 9 years ago

We found the issue on the Number of Ebola cases in Guinea, Liberia, Sierra Leone, Nigeria, Mali, Spain and USA dataset. When an user clicks on the download button directly from the dataset page, the user gets a file from December 31. When an user clicks on "Preview" and then clicks on the URL, she gets a file from January 5 (the latest file).

This is happening with the resource ebola-cases-jan-05-2015-who-gar.xls.

The issue can be understood using the API as well. By querying the following endpoint: https://data.hdx.rwlabs.org/api/action/dataset_show?id=ebola-cases-2014

You get:

{
resource_group_id: "30e2bf85-e312-4215-9ea8-bb4047664546",
resource_uploader: "luiscape",
cache_last_updated: null,
revision_timestamp: "2015-01-05T17:28:28.610002",
webstore_last_updated: null,
datastore_active: false,
id: "76defd41-cca7-4dda-8363-2d2d51d6e877",
size: null,
state: "active",
mimetype: null,
hash: "",
description: "Extracted from latest WHO Ebola Response Roadmap Situation Report",
format: "XLS",
tracking_summary: {
total: 0,
recent: 0
},
last_modified: null,
url_type: null,
perma_link: "http://data.hdx.rwlabs.org/dataset/ebola-cases-2014/resource_download/76defd41-cca7-4dda-8363-2d2d51d6e877",
cache_url: null,
name: "ebola-cases-jan-05-2015-who-gar.xls",
created: "2014-09-08T20:10:21.665010",
url: "http://data.hdx.rwlabs.org/storage/f/2015-01-05T17%3A28%3A21.402Z/ebola-cases-jan-05-2015-who-gar.xls",
webstore_url: null,
mimetype_inner: null,
position: 0,
revision_id: "587abf3b-8641-4744-9e47-f8fb12e26d52",
resource_type: "file.upload"
},

If you click on the perma_link URL you get a file from December 31. If you click on the url URL you get a file from January 5 (the latest file).

teodorescuserban commented 9 years ago

As I see, the Download points to the new permalink system, while the preview doesn't. Maybe the template for the dataset page needs to be modified? I am pretty sure there are no caching issues here. It's just that the link from preview points to another direction than the one on the download.

teodorescuserban commented 9 years ago

Yep. I was right. Look into ckanext-hdx_theme/ckanext/hdx_theme/templates/package/snippets/resource_item.html:

  {% block resource_item_explore %}
    {% if not url_is_edit %}
    {# Adding classes ga-download, ga-preview, and ga-share for easy Google Analytics tracking. PLEASE DO NOT REMOVE #}
    <div class="hdx-btn-group">
      {% block resource_item_explore_links %}
      {% if res.can_be_previewed %}
      <a href="{{ url }}" class="btn btn-secondary hdx-btn ga-preview">
        {{ _('Preview') }}
      </a>
      {% endif %}

      {% set resource_dwd_url = res.perma_link if res.perma_link else res.url %}
      <a href="{{ resource_dwd_url }}" class="btn btn-secondary hdx-btn resource-url-analytics ga-download" title="{{ _('Download') }}" target="_blank">
        <img src="/images/homepage-new/download.svg" alt=" {{ _('Download') }}" />
      </a>

      {% set button_id = 'social-btn-' + res.id %}
      {% set social_div_id = 'social-' + res.id %}
      {% set social_wrapper_div_id = 'social-wrapper-' + res.id %}

First a href is preview and links to url, while the second a href is download and links to _resource_dwdurl

@cjhendrix : assign it to me if @alexandru-m-g has too much already.

cjhendrix commented 9 years ago

Thanks @takavarasha @luis @teodorescuserban. We know what we need to do and it's in sprint 46. However, maybe this needs to be a hotfix? Data team's call.

alexandru-m-g commented 9 years ago

@cjhendrix @luiscape @teodorescuserban

I'm not sure that it's clear what's causing the problem. There are indeed 2 URLs but those URLs should point to the same file and to the same version of the file.

More to the point, why would clicking on the download button give you as a response an older download file. I tried it now and I get the new file. My only guess would be that some proxy server is storing an older version for the permalink.

alexandru-m-g commented 9 years ago

I get this in the HTTP headers which is weird. There should be no hit from rp-C1

(Status-Line) HTTP/1.1 200 OK Server nginx Date Tue, 06 Jan 2015 17:55:04 GMT Content-Type application/vnd.ms-excel Content-Length 173568 Cache-Control no-cache Content-Disposition inline; filename="test.xls" Pragma no-cache Accept-Ranges bytes Last-Modified Tue, 06 Jan 2015 17:47:36 GMT Etag "1420566456.4-173568" Content-Range bytes 0-173567/173568 X-Nginx-Cache MISS Age 59 X-Cache HIT from rp-C1 X-Cache-Lookup HIT from rp-C1:80 Connection keep-alive

alexandru-m-g commented 9 years ago

@teodorescuserban I'm thinking of putting instead of Cache-Control: no-cache -> Cache-Control:"max-age=0, no-store, no-cache"

What do you think ?

alexandru-m-g commented 9 years ago

Once this change gets to staging we can see if it helps.

luiscape commented 9 years ago

@alexandru-m-g just to let you know that the error happened once again in the UN network, but that I didn't grab the log (as you asked). There is another update pending today, I'll try to grab the log then.

alexandru-m-g commented 9 years ago

@luiscape @teodorescuserban @cjhendrix I checked this on staging after the latest changes and things seem to be looking much better. On all tests I get this:

(Status-Line) HTTP/1.1 200 OK Server nginx Date Mon, 12 Jan 2015 18:38:59 GMT Content-Type application/vnd.ms-excel Content-Length 140800 Cache-Control max-age=0, no-store, no-cache Content-Disposition inline; filename="test2.xls" Pragma no-cache Accept-Ranges bytes Last-Modified Mon, 12 Jan 2015 18:37:28 GMT Etag "1421087848.58-140800" Content-Range bytes 0-140799/140800 X-Nginx-Cache MISS X-Cache MISS from rp-C1 X-Cache-Lookup MISS from rp-C1:80 Connection keep-alive

alexandru-m-g commented 9 years ago

The HTTP response headers that I see on staging look much better now. Let's see if this fixes the problem on prod.