Consolidates results for different subdomains against the same overall domain. Enables crawling of links that are on different subdomains
Description
Currently the crawl service will only crawl pages that are on the exact same hostname as the provided base URL, therefore, any links on a site that reference a different subdomain (www. etc.) will not be crawled.
This can have significant impact if all the URLs use the www. subdomain but the base URL does not have it (Only crawls the first page)
The crawl service should be updated to enable the crawling of any page that is on the same domain as the base URL. Crawls against different subdomains should update the known URLs for the overall domain in DynamoDB
Should store results for both subdomain crawls under the same partition key (domain name)
Acceptance Criteria
AC01
Update crawl service to enable the crawling of pages that are on the same domain (regardless of sub domain)
AC02
The crawl service should only allow one crawl every 2 days on any given domain
i.e. Only one crawl should be performed if multiple are initiated on different subdomains
AC03
The crawl service should be updated so the URLs and cached page content are stored against the domain name rather than specific hostname provided by the user
Value Added
Consolidates results for different subdomains against the same overall domain. Enables crawling of links that are on different subdomains
Description
Currently the crawl service will only crawl pages that are on the exact same hostname as the provided base URL, therefore, any links on a site that reference a different subdomain (
www.
etc.) will not be crawled.www.
subdomain but the base URL does not have it (Only crawls the first page)The crawl service should be updated to enable the crawling of any page that is on the same domain as the base URL. Crawls against different subdomains should update the known URLs for the overall domain in DynamoDB
Acceptance Criteria
AC01
AC02
AC03