Closed jonathantullett closed 8 years ago
Use the Search API to access Crawlbot produced datasets and get back actual usable objects as defined by the client.
$search = $diffbot->search('author:"Miles Johnson" AND type:article');
$result = $search->call();
foreach ($result as $article) {
echo $article->getTitle();
}
The API itself always returns the whole dataset when being asked for data, so there's no way to "stream" only new results, no, but I am planning to add it to the client in the foreseeable future (see #5).
I looked at the SearchAPI, and looking at the Diffbot docs, searching with an empty string should return all results: 'Leave blank to return all objects in the collection(s).' (under query operators on https://www.diffbot.com/dev/docs/search/), however I'm not seeing the data being returned (I've looked at the URL directly and don't see the data there either, so perhaps the documentation is out of sync on their end?
If this is correct, is loading/processing the raw json the only way of dealing with the whole dataset?
It would appear this has changed, indeed. Let me check and get back to you.
Confirmed, and docs have been updated. An empty search query will not return the full set.
However, seeing as the collection will contain entities of various types, you won't be able to use them properly in a loop anyway. Consider:
foreach ($result as $article) {
echo $article->getTitle();
}
This would fail if some of the entities were custom, or products, or whatnot. Ergo, it would likely be best if you queried with at least a type (type=article
). If all the entities in the resultset are of the same type, even better - you get all your entities, AND you're sure they're exactly what you expect. Would this be acceptable?
Yes, querying by type makes sense, and as you say, ensures the results are an object type you're expecting. Is that a change you need to make in the codebase or something that's already supported?
Already in there. Just pass "type:x" as the query where x is your desired type
Working perfectly - thanks!
From reading the docs, it looks like loading the json via the downloadUrl() method on the Crawl job is the only way to do it, however as that'll not give any getters/setters/objects (because it's processing the raw JSON data) it smells...wrong.
Is there a better way of doing this?
Related, as the crawl job updates as new pages are discovered, is there a way of just downloading the new dataset - data since the last query (so there's no reprocessing of data) - or is that left as an exercise for the reader?
Thanks!