Closed takavarasha closed 9 years ago
@alexandru-m-g any feedback on this?
I think we should discuss this with @teodorescuserban . Is it an nginx cache that causes this ? I know I've tried this a few times ( not sure if on production, but surely on staging ) and I got the new version each time.
@takavarasha @danmihaila @cjhendrix I've tried to reproduce this today both on staging and on production. I did the following:
Also checked, Pragma and Cache-Control are set to no-cache in the HTTP headers when downloading the file
I need some help in reproducing this.
@takavarasha Could you try to reproduce this, please? I know when I am testing, it's easy to get a bit mixed up and think one thing has happened when in fact it was something else. I'm hoping that's the case with this one.
Pinging @takavarasha for an update when you have time. We couldn't replicate this one.
I will try replicating this from our end. The file could have been cached elsewhere.
I manged to replicate the issue from within the UN network today with @luiscape. We will try replicating the issue from outside the UN network and revert with our findings.
We found the issue on the Number of Ebola cases in Guinea, Liberia, Sierra Leone, Nigeria, Mali, Spain and USA dataset. When an user clicks on the download button directly from the dataset page, the user gets a file from December 31. When an user clicks on "Preview" and then clicks on the URL, she gets a file from January 5 (the latest file).
This is happening with the resource ebola-cases-jan-05-2015-who-gar.xls.
The issue can be understood using the API as well. By querying the following endpoint: https://data.hdx.rwlabs.org/api/action/dataset_show?id=ebola-cases-2014
You get:
{
resource_group_id: "30e2bf85-e312-4215-9ea8-bb4047664546",
resource_uploader: "luiscape",
cache_last_updated: null,
revision_timestamp: "2015-01-05T17:28:28.610002",
webstore_last_updated: null,
datastore_active: false,
id: "76defd41-cca7-4dda-8363-2d2d51d6e877",
size: null,
state: "active",
mimetype: null,
hash: "",
description: "Extracted from latest WHO Ebola Response Roadmap Situation Report",
format: "XLS",
tracking_summary: {
total: 0,
recent: 0
},
last_modified: null,
url_type: null,
perma_link: "http://data.hdx.rwlabs.org/dataset/ebola-cases-2014/resource_download/76defd41-cca7-4dda-8363-2d2d51d6e877",
cache_url: null,
name: "ebola-cases-jan-05-2015-who-gar.xls",
created: "2014-09-08T20:10:21.665010",
url: "http://data.hdx.rwlabs.org/storage/f/2015-01-05T17%3A28%3A21.402Z/ebola-cases-jan-05-2015-who-gar.xls",
webstore_url: null,
mimetype_inner: null,
position: 0,
revision_id: "587abf3b-8641-4744-9e47-f8fb12e26d52",
resource_type: "file.upload"
},
If you click on the perma_link
URL you get a file from December 31. If you click on the url
URL you get a file from January 5 (the latest file).
As I see, the Download points to the new permalink system, while the preview doesn't. Maybe the template for the dataset page needs to be modified? I am pretty sure there are no caching issues here. It's just that the link from preview points to another direction than the one on the download.
Yep. I was right. Look into ckanext-hdx_theme/ckanext/hdx_theme/templates/package/snippets/resource_item.html:
{% block resource_item_explore %}
{% if not url_is_edit %}
{# Adding classes ga-download, ga-preview, and ga-share for easy Google Analytics tracking. PLEASE DO NOT REMOVE #}
<div class="hdx-btn-group">
{% block resource_item_explore_links %}
{% if res.can_be_previewed %}
<a href="{{ url }}" class="btn btn-secondary hdx-btn ga-preview">
{{ _('Preview') }}
</a>
{% endif %}
{% set resource_dwd_url = res.perma_link if res.perma_link else res.url %}
<a href="{{ resource_dwd_url }}" class="btn btn-secondary hdx-btn resource-url-analytics ga-download" title="{{ _('Download') }}" target="_blank">
<img src="/images/homepage-new/download.svg" alt=" {{ _('Download') }}" />
</a>
{% set button_id = 'social-btn-' + res.id %}
{% set social_div_id = 'social-' + res.id %}
{% set social_wrapper_div_id = 'social-wrapper-' + res.id %}
First a href is preview and links to url, while the second a href is download and links to _resource_dwdurl
@cjhendrix : assign it to me if @alexandru-m-g has too much already.
Thanks @takavarasha @luis @teodorescuserban. We know what we need to do and it's in sprint 46. However, maybe this needs to be a hotfix? Data team's call.
@cjhendrix @luiscape @teodorescuserban
I'm not sure that it's clear what's causing the problem. There are indeed 2 URLs but those URLs should point to the same file and to the same version of the file.
More to the point, why would clicking on the download button give you as a response an older download file. I tried it now and I get the new file. My only guess would be that some proxy server is storing an older version for the permalink.
I get this in the HTTP headers which is weird. There should be no hit from rp-C1
(Status-Line) HTTP/1.1 200 OK Server nginx Date Tue, 06 Jan 2015 17:55:04 GMT Content-Type application/vnd.ms-excel Content-Length 173568 Cache-Control no-cache Content-Disposition inline; filename="test.xls" Pragma no-cache Accept-Ranges bytes Last-Modified Tue, 06 Jan 2015 17:47:36 GMT Etag "1420566456.4-173568" Content-Range bytes 0-173567/173568 X-Nginx-Cache MISS Age 59 X-Cache HIT from rp-C1 X-Cache-Lookup HIT from rp-C1:80 Connection keep-alive
@teodorescuserban I'm thinking of putting instead of Cache-Control: no-cache -> Cache-Control:"max-age=0, no-store, no-cache"
What do you think ?
Once this change gets to staging we can see if it helps.
@alexandru-m-g just to let you know that the error happened once again in the UN network, but that I didn't grab the log (as you asked). There is another update pending today, I'll try to grab the log then.
@luiscape @teodorescuserban @cjhendrix I checked this on staging after the latest changes and things seem to be looking much better. On all tests I get this:
(Status-Line) HTTP/1.1 200 OK Server nginx Date Mon, 12 Jan 2015 18:38:59 GMT Content-Type application/vnd.ms-excel Content-Length 140800 Cache-Control max-age=0, no-store, no-cache Content-Disposition inline; filename="test2.xls" Pragma no-cache Accept-Ranges bytes Last-Modified Mon, 12 Jan 2015 18:37:28 GMT Etag "1421087848.58-140800" Content-Range bytes 0-140799/140800 X-Nginx-Cache MISS X-Cache MISS from rp-C1 X-Cache-Lookup MISS from rp-C1:80 Connection keep-alive
The HTTP response headers that I see on staging look much better now. Let's see if this fixes the problem on prod.
It may take a very long time (several hours in a case we observed yesterday) from the time the file for a resource is updated to the time that users who click on the download link can download the updated file. During this period, users are served the old file when they click on the download resource button. The updated file is available on CKAN, and this can be verified by either using the CKAN API to get the URL of the file for the resource, or for files that can be previewed, by clicking on the preview button and using url provided on the preview page. This issue was experienced on the following dataset: https://data.hdx.rwlabs.org/dataset/bed-capacity