Closed andylolz closed 7 years ago
Hmm this still looks broken…
Looks like running iati crawler status
on the datastore server might provide some clues?
https://github.com/IATI/IATI-Datastore/blob/5b871aa3/iati_datastore/iatilib/crawler.py#L422-L423
@andylolz wrote:
But checking
last_fetch
date on all datasets
Here’s the code (and its output) I used to check that: https://gist.github.com/andylolz/d991656c853d012f85b1468354c36a8e
I’ve just tried debugging this, and it looks like the first URL that the crawler hits is 404ing: https://www.iatiregistry.org//api/action/package_list
Removing the double-slash fixes the problem: https://www.iatiregistry.org/api/action/package_list
I guess this issue was caused by https://github.com/ViderumGlobal/ckanext-iati/issues/126 … Looks like the registry is now on version 2.6.2, but on 8 July (the date of the last successful crawl) it was 2.5.2 (you can tell by viewing source on that page and checking the content of meta name="generator"
).
I’ve removed the trailing slash from the CKAN API base URL in #272, which looks like it fixes the 404ing and gets the crawler running again.
The crawler is working again 🎉 🎉 🎉
thanks @andylolz --- I was looking for FAO data , but not there. Maybe it will kick in with the new crawl...
@stevieflow wrote:
I was looking for FAO data , but not there. Maybe it will kick in with the new crawl...
I don’t think so, no :( the crawler ran last night, so if something you were expecting isn’t there, then it probably isn’t going to appear.
Out of interest, what’s missing that you were expecting to see, @stevieflow? The generated datetime for the FAO Activities is currently 2017-06-26, and the datastore has parsed the data more recently than that… So that looks correct to me.
Crawler appears to be broken again.
I've restarted the server and it seems to be updating again - by the looks of it, the crawler is set to run on reboot, and if it stops running then that's it.
Have also added an action to create a test that automatically checks that the Datastore has updated its contained data recently (like the equivalent test for the Dashboard).
I was thinking about this last night and remembered that issue #230 exists.
It might be that the worker is only set to run at reboot because it is designed to be constantly running, even after it finishes parsing queued datasets.
That’s right – the worker (iati queue background
) should constantly run (and presumably should just restart on reboot). The crawler (iati crawl update
) should run every 24 hours (and presumably should run on a cron).
It looooooks like everything is working normally now… When I run locally, the last_fetched
data looks reasonably similar.
Hi everybody!
So after some digging and things we've determined that last_fetch
and last_successful_fetch
don't update for every dataset every time the crawler runs. They only update based on changes to the IATI registry. If no changes have been detected by the registry, the crawler will skip these datasets. So excluding problems with the registry and the Datastore going down, the crawler should run, as scheduled, every day.
So after some digging and things we've determined that
last_fetch
andlast_successful_fetch
don't update for every dataset every time the crawler runs.
That is true, although it’s not how it’s documented to work: http://datastore.iatistandard.org/docs/api/error/#general-data
…so there’s an issue there.
This issue was about the fact that none of the last_fetch
dates were recent, which strongly suggested that something was awry! The bug in question was resolved in #272, and then some mysterious problem was resolved by restarting.
I’m still not really confident that this does, in fact, run nightly as advertised, which was why I left this ticket open. But as a datastore user, it’s not terribly easy to ascertain whether everything is fine. It would be wonderful if http://datastore.iatistandard.org/api/1/about provided more reliable data – rather than a hardcoded ok: True
and status: healthy
:
At that time there were some problems with the registry that would have made the crawler think no updates had been made for a while. The crawler does run nightly regardless. The problem is with the Registry, not the Datastore.
Other issues found in the process of looking at this one (such as docs) should have their own issue.
At that time there were some problems with the registry that would have made the crawler think no updates had been made for a while.
Good to have this update! I don’t think I had any idea about that.
The crawler does run nightly regardless.
My point here is: I have no way of verifying that. So it’s not that I don’t trust you… But a route that tells me what happened and when would help set my mind at ease. The publicly accessible crawler monitoring data is not up to scratch. I’ve created an issue (#279) about that.
Other issues found in the process of looking at this one (such as docs) should have their own issue.
In fact, I was wrong about http://datastore.iatistandard.org/docs/api/error/#general-data – the documentation is accurate on this.
Why was this issue closed? The Datastore does not appear to be running nightly. See #268
Why was this issue closed?
As noted by @allthatilk, the Datastore is working as expected.
As noted by @allthatilk, the Datastore is working as expected.
@allthatilk’s comment was:
So excluding problems with the registry and the Datastore going down, the crawler should run, as scheduled, every day.
But this certainly doesn’t mean the Datastore is working as expected. Indeed, #268 is a clear example of the Datastore not working as expected.
Documentation suggests the datastore crawler runs daily… But checking
last_fetch
date on all datasets, the most recent update wasnearlyover a week ago.