dpovshed / octopus

Sitemap checker/stress test tool based on ReactPHP
11 stars 1 forks source link

Support for XML Sitemap Index #20

Closed holtkamp closed 6 years ago

holtkamp commented 6 years ago

Idea Multiple XML Sitemaps might be referenced in a "Sitemap Index". It would be nice if:

  1. such a Sitemap Index is properly recognized
  2. the Sitemaps in the Sitemap Index are loaded
  3. the URLs in all Sitemaps are added to the queue of URLs that will be crawled

Approach The following approaches might realize this:

  1. detect a Sitemap Index during population of the TargetManager, iterate over the reference Sitemaps and add all encountered URLs
  2. when a URL is crawled, detect that the response is a Sitemap and add the URLs

I think the first approach is the best: that way the queue is populated with all the URLs right from the start and not gradually "as the URLs are crawled".

Implementation Currently a regular expression is used to detect the URLs in a Sitemap, I think for XML Sitemaps some basic XML processing can be done to detect a Sitemap Index and process the linked Sitemaps. Either with SimpleXmlElement or DOMXPath

holtkamp commented 6 years ago

Released in https://github.com/dpovshed/octopus/releases/tag/0.2.0