Swader / diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
MIT License
53 stars 20 forks source link

Custom Pagination #26

Open Swader opened 8 years ago

Swader commented 8 years ago

The pagination side of Diffbot is buggy at best. It will often fail to recognize articles that are multi-page and will not merge them. What's more, it tops out at 20 pages, so anything longer will get ignored.

The feature suggestion for the client is as follows:

Add a new method to the Article API: paginateBy. This method takes 2 arguments: $identifier and $maxPages. The former is a way to identify the nextPage link element on the page. This element would auto-processed to find out all the next pages programmatically. The latter is the max number of pages to concat.

This method would, in order:

  1. Make an Article API request to the original URL.
  2. Find the nextPage element and process it to find out the pattern to which to attach incrementing numbers, thus generating next pages.
  3. Make an additional Article API request to each page, up to $maxPages number of pages
  4. Concatenate the HTML content of all pages.
  5. Send the merged HTML content as a POST request to the Article API, for a final analysis of the entire post.

Alternatively, in order to save Article API requests and use up only one, the client could just Guzzle the raw HTML of all the articles, extract the content HTML, merge that and send it as POST. This, however, is less reliable, as Diffbot is much better at figuring out what is content on the page, and what isn't (headers, ads, comments, etc.).

Maybe make it a switch of some kind, and additional setter?