codeforIATI / iati-data-bugtracker

🐛 A public log of issues with IATI data and metadata
https://bugtracker.codeforiati.org
3 stars 0 forks source link

[BUG] Australia DFAT (AU-5) blocks requests, with a 403 status code response #9

Open andylolz opened 3 years ago

andylolz commented 3 years ago

Explanation of the bug

Since around July 2020, Australia DFAT (AU-5) servers hosting IATI data appear to have been responding with a 403 status code to various services that consume IATI data. I suspect they are blacklisting these services by IP address, but that’s unclear.

When I request the data from my machine, it works fine, e.g. this dataset: https://iatiregistry.org/dataset/ausgov-af

However, we can see d-portal has had trouble downloading the same dataset: http://d-portal.org/ctrack.html#view=dash_sluglog&slug=ausgov-af Screenshot 2021-01-19 at 15 50 51

In fact, d-portal hasn’t successfully downloaded any Australia DFAT data since their recent server reboot: http://d-portal.org/ctrack.html#view=search&reporting_ref=AU-5 Screenshot 2021-01-19 at 15 50 25

The new datastore also seems to have had problems with it: https://iatidatastore.iatistandard.org/search/activity?q=reporting_org_ref:(AU-5)&wt=json&rows=50

{
  …
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}

…And IATI data dump is showing errors for all Australia DFAT data: https://gist.githubusercontent.com/codeforIATIbot/f117c9be138aa94c9762d57affc51a64/raw/e9e26621d812b89789c6bbc8697fb4461bbb974e/errors


According to the old datastore, Australia DFAT data was last successfully fetched in July 2020: http://datastore.iatistandard.org/api/1/about/dataset/ausgov-af

{
  "dataset": "ausgov-af",
  "last_modified": "2020-04-11T02:44:58.659826",
  "num_resources": 1,
  "resources": [
    {
      "last_fetch": "2020-07-16T17:55:03.624162",
      "last_parsed": "2020-07-17T05:10:22.817871",
      "last_status_code": 200,
      "last_successful_fetch": "2020-07-16T17:55:03.624352",
      "num_of_activities": 177,
      "url": "https://www.dfat.gov.au/sites/default/files/Australian_Aid_Country_File_Afghanistan.xml"
    }
  ]
}
andylolz commented 3 years ago

Does the publisher know?

I contacted them today.

notshi commented 3 years ago

Also raised here https://github.com/zimmerman-team/iati.cloud/issues/2470

andylolz commented 3 years ago

Thanks, @notshi! I’ve added a comment there.

stale[bot] commented 3 years ago

This issue has been automatically marked as "awaiting update". If you’ve checked and the issue is still applicable, please add a message to that effect.

andylolz commented 3 years ago

It’s definitely still applicable.

stale[bot] commented 3 years ago

Hello! There has been no activity on this issue in the last 30 days. I wonder if it has now been resolved?!

If you’re reading this, would you mind checking to see if the issue is still applicable?

Thank you!

andylolz commented 3 years ago

would you mind checking to see if the issue is still applicable?

It is, yes. I suspect some services have been whitelisted on Australia DFAT servers, but others have not.

The issue was raised on the registry (https://github.com/IATI/ckanext-iati/issues/323), but after changing the user-agent header (https://github.com/IATI/ckanext-iati/commit/d195e2deed4d6ef77487b9ceb7fa0b2b75bbce49), it looks like the registry archiver is successfully fetching these datasets.

Also, the new datastore now appears to successfully fetch these datasets: https://iatidatastore.iatistandard.org/search/activity?q=reporting_org_ref:(AU-5)&wt=json&rows=50

{
  …
  "response": {
    "numFound": 13632,
    "start": 0,
    "docs": […]
  }
}

However, d-portal is still having trouble downloading Australia DFAT datasets: http://d-portal.org/ctrack.html#view=dash_sluglog&slug=ausgov-af Screenshot 2021-05-01 at 12 15 28

http://d-portal.org/ctrack.html#view=search&reporting_ref=AU-5 Screenshot 2021-05-01 at 13 00 04

The old datastore no longer exists, but datastore classic has never managed to fetch Australia DFAT data. E.g.: https://datastore.codeforiati.org/api/1/about/dataset/ausgov-af

{
    "dataset": "ausgov-af",
    "last_modified": "2021-03-31T23:15:12.448867",
    "num_resources": 1,
    "resources": [
        {
            "last_fetch": "2021-03-02T00:40:31",
            "last_parsed": null,
            "last_status_code": 404,
            "last_successful_fetch": null,
            "num_of_activities": 0,
            "url": "https://www.dfat.gov.au/sites/default/files/Australian_Aid_Country_File_Afghanistan.xml"
        }
    ]
}

(NB the last_status_code is wrong here – that’s a known bug.)

notshi commented 3 years ago

Thanks for the update, @andylolz

d-portal downloads the files randomly and in no particular order so assuming we are allowed to at least get one file a day, we might eventually get all of them. However, as you've mentioned, we are probably not whitelisted.

Via our database logs, the last last successful fetch was 15 Dec 2020 for 13,650 activities.

This was the same day the d-portal server died due to a disk error which took a week to be replaced and installed, and our ip address would have also changed as a result. When the hard disk on the server failed, we lost any data that was currently in a cached state. This would explain why there is no 'last successfully downloaded' date for AU-5 datasets on Dash.

The Dashboard is also having trouble downloading the files. http://dashboard.iatistandard.org/publisher/ausgov.html Screenshot-20210501142412-573x586

Looks like it's been happening since 2020-07-13 19:36:57 +0100.

Oddly, there was a blip on 2021-04-21 14:52:12 +0100 where the Dashboard was able to retrieve 13,927 activities!

matmaxgeds commented 3 years ago

I am not sure it feels 'right', but where the DSv2 has access, could we set it as a backup source - I think they have an internal url listed for each file: https://iatidatastore.iatistandard.org/api/datasets/?publisher_identifier=AU-5&format=json but not sure if it actually accessible

andylolz commented 3 years ago

where the DSv2 has access, could we set it as a backup source - I think they have an internal url listed for each file

Oooh – this is useful to know about, thanks! Interesting – those internal_urls appear to have saved the HTML of a 404 page, that looks like this. That possibly means DSv2 is also struggling to import Australia DFAT data.

andylolz commented 3 years ago

Thanks for this, @notshi!

The Dashboard is also having trouble downloading the files. http://dashboard.iatistandard.org/publisher/ausgov.html

I did not think to check the dashboard! The codeforIATI version doesn’t list ausgov as a publisher at all: https://dashboard.codeforiati.org/publisher/ausgov.html

Presumably that’s because it has never seen any ausgov data.

andylolz commented 3 years ago

I moved iati-data-dump to github actions today, and it successfully managed to download ausgov data.

This data should start to bubble up through codeforIATI services, e.g. to datastore classic and the dashboard.

notshi commented 3 years ago

We are using User-Agent Mozilla/5.0 and that looks to be blocked by AU-5. This is a problem as we starting using this because other servers block curl.

So now we use the default curl ua as the the backup. And this seems to have solved issues with many servers that were previously giving us errors.

We now have over a million activities so it looks like this might have found us 40,000 activities.

http://d-portal.org/ctrack.html#view=search&reporting_ref=AU-5 Screenshot-20210507124946-524x264

http://d-portal.org/ctrack.html#view=dash_sluglog&slug=ausgov-af Screenshot-20210507124904-830x446

By the way, there are still issues with some datasets by AU-5.

Screenshot-20210507130441-717x489

andylolz commented 3 years ago

So now we use the default curl ua as the the backup. And this seems to have solved issues with many servers that were previously giving us errors.

I’m still unsure how this blocking works… Perhaps a combination of user-agent and IP address? (I didn’t change user-agent but did change IP address, and that also did the trick).

By the way, there are still issues with some datasets by AU-5.

That’s true… But those are 404 errors, so I think we should consider them separately.

andylolz commented 3 years ago

I’m not really sure what to do with this ticket? I’m not convinced it’s fixed, but it sounds like we’re not experiencing the problem right now…

notshi commented 3 years ago

That’s true… But those are 404 errors, so I think we should consider them separately.

Agreed!

I think the ticket should still be opened due to ongoing issues and also because the Dashboard is still getting errors accessing the data. Should we raise it with the Dashboard maintainers?

andylolz commented 3 years ago

I think the ticket should still be opened due to ongoing issues

Cool okay, agreed.

also because the Dashboard is still getting errors accessing the data

Oh, good point – you’re right.

notshi commented 3 years ago

Ok so this might trickle down to who/what tools we are adding data issues for.

Might be worth updating the readme?

So for example, this repo tracks issues that affect externally maintained tools using IATI, etc. This should include the Registry.

I mean, I'd like to consider this issue closed but AU-5 might hiccup tomorrow or in a week's time. Though we could re-open or create a new issue if that happens.

notshi commented 3 years ago

By the way, feel free to ignore suggestions if they seem pedantic! It's mostly for my train of thought and process.

andylolz commented 3 years ago

No problem at all!

Let’s make a new meta ticket to decide what should/shouldn’t be recorded in this repo.

siemvaessen commented 3 years ago

Looks available? IATI.cloud had some issues on this as well, but https://iatidatastore.iatistandard.org/search/activity?q=reporting_org_ref:(AU-5)&wt=json&rows=13632 seems to show all AU-5 data?

andylolz commented 3 years ago

Looks available? IATI.cloud had some issues on this as well, but https://iatidatastore.iatistandard.org/search/activity?q=reporting_org_ref:(AU-5)&wt=json&rows=13632 seems to show all AU-5 data?

The problem is, the DFAT server is a bit over-zealous in its blocking. So while lots of requests succeed (that link works for me, too) a lot of requests are getting blocked. Limiting availability is probably bad practice when serving open data.

For instance, the IATI dashboard still appears unable to access DFAT data: Screenshot 2021-06-01 at 18 12 15

While the problem might not be affecting our services, we have evidence suggesting the problem does still exist. So I’d rather not close this yet.

siemvaessen commented 3 years ago

Yeah, we're aware of this issue. We also decided to not spend any more time on this as the data owner is seemingly very reluctant to make any changes on their end.

codeforIATIbot commented 3 years ago

Hello! There has been no activity on this issue in the last 30 days. I wonder if it has now been resolved?

If you’re reading this, would you mind checking to see if the issue is still applicable?

Thank you!

notshi commented 3 years ago

The Dashboard is still unable to access DFAT (AU-5) data / servers.

codeforIATIbot commented 3 years ago

Hello! There has been no activity on this issue in the last 30 days. I wonder if it has now been resolved?

If you’re reading this, would you mind checking to see if the issue is still applicable?

Thank you!

notshi commented 3 years ago

Previous issue still applies and I doubt this will change as the publisher seems non-responsive to correspondences.

In such cases, should there be a new label where a bug is defined as "won't fix" and no reminder set? @andylolz

andylolz commented 3 years ago

In such cases, should there be a new label where a bug is defined as "won't fix" and no reminder set? @andylolz

^^ Yeah, there’s an evergreen label for this purpose.