IATI / IATI-Datastore

An open-source datastore for IATI data with RESTful web API providing XML, JSON, CSV plus ETL tools
http://datastore.iatistandard.org/
Other
1 stars 0 forks source link

Last fetch date was 7 days ago… How often does the crawler run? #271

Closed andylolz closed 7 years ago

andylolz commented 7 years ago

Documentation suggests the datastore crawler runs daily… But checking last_fetch date on all datasets, the most recent update was nearly over a week ago.

andylolz commented 7 years ago

Hmm this still looks broken…

Looks like running iati crawler status on the datastore server might provide some clues? https://github.com/IATI/IATI-Datastore/blob/5b871aa3/iati_datastore/iatilib/crawler.py#L422-L423

andylolz commented 7 years ago

@andylolz wrote:

But checking last_fetch date on all datasets

Here’s the code (and its output) I used to check that: https://gist.github.com/andylolz/d991656c853d012f85b1468354c36a8e

andylolz commented 7 years ago

I’ve just tried debugging this, and it looks like the first URL that the crawler hits is 404ing: https://www.iatiregistry.org//api/action/package_list

Removing the double-slash fixes the problem: https://www.iatiregistry.org/api/action/package_list

I guess this issue was caused by https://github.com/ViderumGlobal/ckanext-iati/issues/126 … Looks like the registry is now on version 2.6.2, but on 8 July (the date of the last successful crawl) it was 2.5.2 (you can tell by viewing source on that page and checking the content of meta name="generator").

I’ve removed the trailing slash from the CKAN API base URL in #272, which looks like it fixes the 404ing and gets the crawler running again.

andylolz commented 7 years ago

The crawler is working again 🎉 🎉 🎉

stevieflow commented 7 years ago

thanks @andylolz --- I was looking for FAO data , but not there. Maybe it will kick in with the new crawl...

andylolz commented 7 years ago

@stevieflow wrote:

I was looking for FAO data , but not there. Maybe it will kick in with the new crawl...

I don’t think so, no :( the crawler ran last night, so if something you were expecting isn’t there, then it probably isn’t going to appear.

Out of interest, what’s missing that you were expecting to see, @stevieflow? The generated datetime for the FAO Activities is currently 2017-06-26, and the datastore has parsed the data more recently than that… So that looks correct to me.

andylolz commented 7 years ago

Crawler appears to be broken again.

hayfield commented 7 years ago

I've restarted the server and it seems to be updating again - by the looks of it, the crawler is set to run on reboot, and if it stops running then that's it.

Have also added an action to create a test that automatically checks that the Datastore has updated its contained data recently (like the equivalent test for the Dashboard).

dalepotter commented 7 years ago

I was thinking about this last night and remembered that issue #230 exists.

It might be that the worker is only set to run at reboot because it is designed to be constantly running, even after it finishes parsing queued datasets.

andylolz commented 7 years ago

That’s right – the worker (iati queue background) should constantly run (and presumably should just restart on reboot). The crawler (iati crawl update) should run every 24 hours (and presumably should run on a cron).

It looooooks like everything is working normally now… When I run locally, the last_fetched data looks reasonably similar.

allthatilk commented 7 years ago

Hi everybody!

So after some digging and things we've determined that last_fetch and last_successful_fetch don't update for every dataset every time the crawler runs. They only update based on changes to the IATI registry. If no changes have been detected by the registry, the crawler will skip these datasets. So excluding problems with the registry and the Datastore going down, the crawler should run, as scheduled, every day.

andylolz commented 7 years ago

So after some digging and things we've determined that last_fetch and last_successful_fetch don't update for every dataset every time the crawler runs.

That is true, although it’s not how it’s documented to work: http://datastore.iatistandard.org/docs/api/error/#general-data

…so there’s an issue there.

This issue was about the fact that none of the last_fetch dates were recent, which strongly suggested that something was awry! The bug in question was resolved in #272, and then some mysterious problem was resolved by restarting.

I’m still not really confident that this does, in fact, run nightly as advertised, which was why I left this ticket open. But as a datastore user, it’s not terribly easy to ascertain whether everything is fine. It would be wonderful if http://datastore.iatistandard.org/api/1/about provided more reliable data – rather than a hardcoded ok: True and status: healthy:

https://github.com/IATI/IATI-Datastore/blob/852ec9a33c0f44f674fb6cdcd6cf7c9fe4ac9e8e/iati_datastore/iatilib/frontend/api1.py#L34-L39

allthatilk commented 7 years ago

At that time there were some problems with the registry that would have made the crawler think no updates had been made for a while. The crawler does run nightly regardless. The problem is with the Registry, not the Datastore.

Other issues found in the process of looking at this one (such as docs) should have their own issue.

andylolz commented 7 years ago

At that time there were some problems with the registry that would have made the crawler think no updates had been made for a while.

Good to have this update! I don’t think I had any idea about that.

The crawler does run nightly regardless.

My point here is: I have no way of verifying that. So it’s not that I don’t trust you… But a route that tells me what happened and when would help set my mind at ease. The publicly accessible crawler monitoring data is not up to scratch. I’ve created an issue (#279) about that.

Other issues found in the process of looking at this one (such as docs) should have their own issue.

In fact, I was wrong about http://datastore.iatistandard.org/docs/api/error/#general-data – the documentation is accurate on this.

markbrough commented 7 years ago

Why was this issue closed? The Datastore does not appear to be running nightly. See #268

hayfield commented 7 years ago

Why was this issue closed?

As noted by @allthatilk, the Datastore is working as expected.

andylolz commented 7 years ago

As noted by @allthatilk, the Datastore is working as expected.

@allthatilk’s comment was:

So excluding problems with the registry and the Datastore going down, the crawler should run, as scheduled, every day.

But this certainly doesn’t mean the Datastore is working as expected. Indeed, #268 is a clear example of the Datastore not working as expected.