Kendra Web Crawler documents not showing in QnABot Designer

maxsiopl commented 1 year ago

Describe the bug When I use Kendra Web Crawler "Start Crawling" button after some time it says successfully, that the crawling status is "COMPLETE" and there is number of documents added. But after finishing web crawling I can't see questions generated by Kendra in Edit Questions tab.

My configuration is: ENABLE_KENDRA_WEB_INDEXER: true KENDRA_INDEXER_URLS: https://docs.aws.amazon.com/ (example) KENDRA_INDEXER_SCHEDULE: rate(1 day) KENDRA_INDEXER_CRAWL_DEPTH: 4 KENDRA_INDEXER_CRAWL_MODE: SUBDOMAINS KENDRA_WEB_PAGE_INDEX: f0459633---****-44fb11faaa6f

Here you have the effect of my recent crawling. Previous ones are my experimentations - don't look at that.

Expected behavior Kendra documents visible in Edit Questions tab. Possibility to test crawled documents in Test panel.

Please complete the following information about the solution:

[X] QnABot Version: 5.2.5
[X] Region: us-east-1
[X] Have you checked your service quotas for the sevices this solution uses?
[] Were there any errors in the CloudWatch Logs? No

Additional context I searched through all Kendra, QnABot documentations and didn't found any information about how to use crawled documents. I don't know how this should work. Can anyone help me? Thank you.

ihmaws commented 1 year ago

Hi @maxsiopl, thank you for the question and for providing all the details.

The Kendra Web Crawler feature is meant as a convenience feature to leverage Kendra's WebCrawler data source from within the Content Designer.

It doesn't reimport any questions back into the questions tab (and isn't meant to). When the "Start Crawling" Button is clicked, the Kendra APIs are used to create a new data source in the provided index. Once successfully created, if using the Kendra fallback feature, then queries that don't match through OpenSearch will search Kendra and try to search the index for a response. You may need to set ALT_SEARCH_KENDRA_INDEXES, if not already using it (details)

And regarding your second question "Possibility to test crawled documents in Test panel.", this is currently not supported; however, is something we have discussed internally and may plan for a future release. For now, you must open a session through the client UI to test Kendra Fallback responses.

Please let me know if that helps answer your question or if you've got any follow ups.

maxsiopl commented 1 year ago

It helped me a lot, thank you very much for your response!

maxsiopl commented 1 year ago

Hello @ihmaws , coming back to this thread after some time. The issue still persists. After successful web crawling(in this example some Discord FAQ website):

In Kendra, I can see the documents added by the web crawler which is good. But if I use any question from given documents in Alexa Skill, for example:

It returns error response:

Logs from fulfillment Lambda:

2023-04-06T12:17:41.759Z 0255ff2e-13e4-4ec4-9937-b535cddb30d5 INFO kendra query request: { "kendra_faq_index": "e2a9cc66-13c4-43f6-a003-eb025b6b446b", "maxRetries": 8, "retryDelay": 600, "minimum_score": "HIGH", "size": 1, "question": "what is vid con", "es_address": "search-qnabot-elasti-1jc9qzq3hhe6t-gz2ovqjifwlzcyotxrzspjujgy.us-east-1.es.amazonaws.com", "es_path": "/qnabot-website-crawler/_doc/_search?search_type=dfs_query_then_fetch", "same_index": true }

2023-04-06T12:17:41.864Z 0255ff2e-13e4-4ec4-9937-b535cddb30d5 INFO
{ "message": "Index Id e2a9cc66-13c4-43f6-a003-eb025b6b446b not found for Customer Id 003994647303.", "code": "ResourceNotFoundException", "time": "2023-04-06T12:17:41.863Z", "requestId": "639b3450-2f06-4125-94c2-a2b6f663c48e", "statusCode": 400, "retryable": false, "retryDelay": 246.575879864807 }

2023-04-06T12:17:41.865Z 0255ff2e-13e4-4ec4-9937-b535cddb30d5 ERROR Invoke Error
{ "errorType": "Error", "errorMessage": "Error from Kendra query request:ResourceNotFoundException: Index Id e2a9cc66-13c4-43f6-a003-eb025b6b446b not found for Customer Id 003994647303.", "stack": [ "Error: Error from Kendra query request:ResourceNotFoundException: Index Id e2a9cc66-13c4-43f6-a003-eb025b6b446b not found for Customer Id 003994647303.", " at intoError (file:///var/runtime/index.mjs:46:16)", " at postError (file:///var/runtime/index.mjs:707:51)", " at callback (file:///var/runtime/index.mjs:723:11)", " at file:///var/runtime/index.mjs:778:20", " at router.start (/var/task/lib/router/index.js:24:17)", " at processTicksAndRejections (node:internal/process/task_queues:96:5)" ] }

I'm curious about that Customer ID error. Can you confirm if its Kendra related or QnABot-related error? Thank you in advance for any help.

ihmaws commented 1 year ago

Hey @maxsiopl, I see the following:

"message": "Index Id e2a9cc66-13c4-43f6-a003-eb025b6b446b not found for Customer Id 003994647303."

Does the Kendra index you have set for kendra_faq_index exist in your account?

I believe the FAQ_INDEX setting is used for pointing to an index where QnABot can upload your question bank into the Kendra index (to provide semantic search capabilities instead of keyword based search). It is not intended for use as a fallback. So if the index does exist, and you have never synced your question bank into the index (see screenshot), then the FAQ will be empty and may be responsible for this error.

If you want to use the kendra fallback you need to use ALT_SEARCH_KENDRA_INDEXES.

But these are all just guesses. Can you let me know what your Kendra settings are (redact private info of course!)? Also, please give me a short description of what you are trying to do and I may be able to provide some general tips (for example I now see Alexa Skill involved, very cool!)

maxsiopl commented 1 year ago

Okay, so when KENDRA_FAQ_INDEX is set and I try to SYNC KENDRA FAQ after some time it returns just error:

But in CloudWatch Lambda logs we got:

2023-04-10T15:02:51.322Z 993ab2e0-5d5d-4d77-a0c6-8a420c9d64b6 INFO describeFaq { "Id": "4f4a7474-3d1a-412e-9ece-6eecbad0a6df", "IndexId": "bfd4157b-83e6-4d5b-87c3-cf2757beeee7", "Name": "qna-facts", "Description": "Exported FAQ of questions from QnABot designer console", "CreatedAt": "2023-04-10T15:01:15.067Z", "UpdatedAt": "2023-04-10T15:02:46.769Z", "S3Path": { "Bucket": "qnabot-website-crawler-exportbucket-3l52jl0qjoh6", "Key": "kendra_json/qna_FAQ.json" }, "Status": "FAILED", "RoleArn": "arn:aws:iam::003994647303:role/QnABot-Website-Crawler-ExportStack-1F-KendraS3Role-1BGJRE7D5SM68", "ErrorMessage": "The FaqDocuments list in the specified JSON file is null or empty. You must specify a list. Make sure that you have a list of FAQ documents and try your request again.", "FileFormat": "JSON" }

Here is my QnABot config(Export config function only returns json with settings other than default, but I wanted to include all settings):

Basically the goal of this QnABot would be to scan FAQ website (in this example Discord FAQ) and redirect those faqs to Alexa Skill. Is this possible with QnABot Web Crawler? Can you explain what I am doing wrong? :) Also, in this scenario how should I use Kendra Fallback?

Again thanks for help!

ihmaws commented 1 year ago

Thanks for the settings, that helps. It seems my last comment was more confusing than it was helpful. Let me try to be more clear (I just setup a test using v5.3.1 and asked "What is Vidcon" through alexa and it provided the right response).

Clear KENDRA_FAQ_INDEX setting. From what I can tell, you are not using that feature so it is causing errors when trying to search for responses (it looks like you are not using Kendra FAQ, you are using a Kendra Data source)
You are using 2 different Kendra indexes. KENDRA_WEB_PAGE_INDEX is the index where web crawled documents will be stored. ALT_SEARCH_KENDRA_INDEXES is where QnABot will search for responses. So you need to include the KENDRA_WEB_PAGE_INDEX into the ALT_SEARCH_KENDRA_INDEXES. See here

Give that a try and let see how it goes.

maxsiopl commented 1 year ago

YOU ARE GOD! After performing these 2 tasks you proposed now everything works just fine! Thank you a lot! I guess it is worth mentioning in Kendra Web Crawler readme about where to fill specific Kendra indexes just to be clear :) Thank you again!

ihmaws commented 1 year ago

Thanks for the feedback, I'll have a look at that section of the documentation to see how we can make it more clear.

Glad it all worked out, best of luck! And don't forget to upgrade and test on the latest version while running your initial prototype!

aws-solutions / qnabot-on-aws

Kendra Web Crawler documents not showing in QnABot Designer #565