aws-samples / aws-step-functions-kendra-web-crawler-search-engine

This sample aims to demonstrate how to create a serverless web crawler and search engine, using AWS Lambda, AWS Step Functions, and Amazon Kendra
https://aws.amazon.com/blogs/architecture/scaling-up-a-serverless-web-crawler-and-search-engine/
MIT No Attribution
114 stars 34 forks source link

🕷Serverless Web Crawler and Search Engine with Step Functions and Kendra

Overview

This sample aims to demonstrate how to create a serverless web crawler (or web scraper) using AWS Lambda and AWS Step Functions. It scales to crawl large websites that would time out if we used just a single lambda to crawl a site. The web crawler is written in Typescript, and uses Puppeteer to extract content and URLs from a given webpage.

Additionally, this sample demonstrates an example use-case for the crawler by indexing crawled content into Amazon Kendra, providing a machine-learning powered search over our crawled content. The CloudFormation stack for the Kendra resources is optional, you can deploy just the web crawler if you like. Make sure to review kendra's pricing and free tier before deploying the kendra part of the sample.

The AWS Cloud Development Kit (CDK) is used to define the infrastructure for this sample as code.

Architecture

architecture-diagram

The Web Crawler

The web crawler is best explained by the AWS Step Functions State Machine diagram:

state-machine-diagram

Prerequisites

Deploy

This repository provides a utility script to build and deploy the sample.

To deploy the web crawler on its own, run:

./deploy --profile <YOUR_AWS_PROFILE>

Or you can deploy the web crawler with Kendra too:

./deploy --profile <YOUR_AWS_PROFILE> --with-kendra

Note that if deploying with Kendra, ensure your profile is configured with one of the AWS regions that supports Kendra. See the AWS Regional Services List for details.

Run The Crawler

When the infrastructure has been deployed, you can trigger a run of the crawler with the included utility script:

./crawl --profile <YOUR_AWS_PROFILE> --name lambda-docs --base-url https://docs.aws.amazon.com/ --start-paths /lambda --keywords lambda/latest/dg

You can play with the arguments above to try different websites.

The crawl script will print a link to the AWS console so you can watch your Step Function State Machine execution in action.

Search Crawled Content

If you also deployed the Kendra stack (--with-kendra), you can visit the Kendra console to see an example search page for the Kendra index. The crawl script will print a link to this page if you deployed Kendra. Note that it will take a few minutes once the crawler has completed for Kendra to index the newly stored content.

kendra-screenshot

Run The Crawler Locally

If you're playing with the core crawler logic, it might be handy to test it out locally.

You can run the crawler locally with:

./local-crawl --base-url https://docs.aws.amazon.com/ --start-paths /lambda --keywords lambda/latest/dg

Cleaning Up

You can clean up all your resources when you're done via the destroy script.

If you deployed just the web crawler:

./destroy --profile <YOUR_AWS_PROFILE>

Or if you deployed the web crawler with Kendra too:

./destroy --profile <YOUR_AWS_PROFILE> --with-kendra

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.