Question: What's the 'correct' way to access all results of a CrawlJob?

Swader / diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library

MIT License

53 stars 20 forks source link

Question: What's the 'correct' way to access all results of a CrawlJob? #41

Closed jonathantullett closed 8 years ago

jonathantullett commented 8 years ago

From reading the docs, it looks like loading the json via the downloadUrl() method on the Crawl job is the only way to do it, however as that'll not give any getters/setters/objects (because it's processing the raw JSON data) it smells...wrong.

Is there a better way of doing this?

Related, as the crawl job updates as new pages are discovered, is there a way of just downloading the new dataset - data since the last query (so there's no reprocessing of data) - or is that left as an exercise for the reader?

Thanks!

Swader commented 8 years ago

Use the Search API to access Crawlbot produced datasets and get back actual usable objects as defined by the client.

$search = $diffbot->search('author:"Miles Johnson" AND type:article');
$result = $search->call();

foreach ($result as $article) {
    echo $article->getTitle();
}

The API itself always returns the whole dataset when being asked for data, so there's no way to "stream" only new results, no, but I am planning to add it to the client in the foreseeable future (see #5).

jonathantullett commented 8 years ago

I looked at the SearchAPI, and looking at the Diffbot docs, searching with an empty string should return all results: 'Leave blank to return all objects in the collection(s).' (under query operators on https://www.diffbot.com/dev/docs/search/), however I'm not seeing the data being returned (I've looked at the URL directly and don't see the data there either, so perhaps the documentation is out of sync on their end?

If this is correct, is loading/processing the raw json the only way of dealing with the whole dataset?

Swader commented 8 years ago

It would appear this has changed, indeed. Let me check and get back to you.

Swader commented 8 years ago

Confirmed, and docs have been updated. An empty search query will not return the full set.

However, seeing as the collection will contain entities of various types, you won't be able to use them properly in a loop anyway. Consider:

foreach ($result as $article) {
    echo $article->getTitle();
}

This would fail if some of the entities were custom, or products, or whatnot. Ergo, it would likely be best if you queried with at least a type (type=article). If all the entities in the resultset are of the same type, even better - you get all your entities, AND you're sure they're exactly what you expect. Would this be acceptable?

jonathantullett commented 8 years ago

Yes, querying by type makes sense, and as you say, ensures the results are an object type you're expecting. Is that a change you need to make in the codebase or something that's already supported?

Swader commented 8 years ago

Already in there. Just pass "type:x" as the query where x is your desired type

jonathantullett commented 8 years ago

Working perfectly - thanks!