This sample aims to demonstrate how to create a serverless web crawler (or web scraper) using AWS Lambda and AWS Step Functions. It scales to crawl large websites that would time out if we used just a single lambda to crawl a site. The web crawler is written in Typescript, and uses Puppeteer to extract content and URLs from a given webpage.
Additionally, this sample demonstrates an example use-case for the crawler by indexing crawled content into Amazon Kendra, providing a machine-learning powered search over our crawled content. The CloudFormation stack for the Kendra resources is optional, you can deploy just the web crawler if you like. Make sure to review kendra's pricing and free tier before deploying the kendra part of the sample.
The AWS Cloud Development Kit (CDK) is used to define the infrastructure for this sample as code.
The web crawler is best explained by the AWS Step Functions State Machine diagram:
src/constructs/webcrawler/constants.ts
. You can also update the environment variable PARALLEL_URLS_TO_SYNC
to change the batch size after the CDK stack is deployed.Read Queued Urls
step.npm -v && node -v
npm i -g aws-cdk
This repository provides a utility script to build and deploy the sample.
To deploy the web crawler on its own, run:
./deploy --profile <YOUR_AWS_PROFILE>
Or you can deploy the web crawler with Kendra too:
./deploy --profile <YOUR_AWS_PROFILE> --with-kendra
Note that if deploying with Kendra, ensure your profile is configured with one of the AWS regions that supports Kendra. See the AWS Regional Services List for details.
When the infrastructure has been deployed, you can trigger a run of the crawler with the included utility script:
./crawl --profile <YOUR_AWS_PROFILE> --name lambda-docs --base-url https://docs.aws.amazon.com/ --start-paths /lambda --keywords lambda/latest/dg
You can play with the arguments above to try different websites.
--base-url
is used to specify the target website to crawl, only URLs starting with the base url will be queued.--start-paths
specifies one or more paths in the website to start at.--keywords
parameter filters the URLs which are queued to only ones containing one or more of the given keywords, (ie above, only URLs containing lambda/latest/dg
are added to the queue)--name
is optional, and is used to help identify which step function execution or dynamodb table corresponds to which crawl.The crawl
script will print a link to the AWS console so you can watch your Step Function State Machine execution in action.
If you also deployed the Kendra stack (--with-kendra
), you can visit the Kendra console to see an example
search page for the Kendra index. The crawl
script will print a link to this page if you deployed Kendra. Note that it will
take a few minutes once the crawler has completed for Kendra to index the newly stored content.
If you're playing with the core crawler logic, it might be handy to test it out locally.
You can run the crawler locally with:
./local-crawl --base-url https://docs.aws.amazon.com/ --start-paths /lambda --keywords lambda/latest/dg
You can clean up all your resources when you're done via the destroy
script.
If you deployed just the web crawler:
./destroy --profile <YOUR_AWS_PROFILE>
Or if you deployed the web crawler with Kendra too:
./destroy --profile <YOUR_AWS_PROFILE> --with-kendra
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.