Idea
Multiple XML Sitemaps might be referenced in a "Sitemap Index". It would be nice if:
such a Sitemap Index is properly recognized
the Sitemaps in the Sitemap Index are loaded
the URLs in all Sitemaps are added to the queue of URLs that will be crawled
Approach
The following approaches might realize this:
detect a Sitemap Index during population of the TargetManager, iterate over the reference Sitemaps and add all encountered URLs
when a URL is crawled, detect that the response is a Sitemap and add the URLs
I think the first approach is the best: that way the queue is populated with all the URLs right from the start and not gradually "as the URLs are crawled".
Implementation
Currently a regular expression is used to detect the URLs in a Sitemap, I think for XML Sitemaps some basic XML processing can be done to detect a Sitemap Index and process the linked Sitemaps. Either with SimpleXmlElement or DOMXPath
Idea Multiple XML Sitemaps might be referenced in a "Sitemap Index". It would be nice if:
Approach The following approaches might realize this:
I think the first approach is the best: that way the queue is populated with all the URLs right from the start and not gradually "as the URLs are crawled".
Implementation Currently a regular expression is used to detect the URLs in a Sitemap, I think for XML Sitemaps some basic XML processing can be done to detect a Sitemap Index and process the linked Sitemaps. Either with SimpleXmlElement or DOMXPath