aws-samples / bedrock-claude-chat

AWS-native chatbot using Bedrock + Claude (+Mistral)
MIT No Attribution
688 stars 232 forks source link

[Bug] Timeout Occurs in Embedding When Adding a Large Number of URLs to Knowledge #341

Closed edamame8888 closed 3 weeks ago

edamame8888 commented 4 weeks ago

Describe the bug

When creating a bot, inputting about 100 URLs (not YouTube URLs) results in the following error in BedrockChatStack-EmbeddingTask:

ERROR:embedding.loaders.playwright:Error fetching or processing https://***, exception: Timeout 30000ms exceeded.

version: v1

To Reproduce

Screenshots

BedrockChatStack-EmbeddingTaskLog スクリーンショット 2024-06-05 9 05 18

Additional context

Workaround Tried

I confirmed an improvement by calling page.close() within a loop of playwright on my end.

https://github.com/aws-samples/bedrock-claude-chat/blob/v1/backend/embedding/loaders/playwright.py#L150-L170

            for url in self.urls:
                try:
                    page = browser.new_page()
                    response = page.goto(url)
                    if response is None:
                        raise ValueError(
                            f"page.goto() returned None for url {url}")

                    text = self.evaluator.evaluate(page, browser, response)
                    metadata = {"source": url}
                    docs.append(Document(page_content=text, metadata=metadata))
+                    page.close()
+                    logger.info(f"Loaded {url} and page closed.")
                except Exception as e:
                    if self.continue_on_failure:
                        logger.error(
                            f"Error fetching or processing {url}, exception: {e}"
                        )
                    else:
                        raise e
            browser.close()
        return docs

Result

success-embedding-log
statefb commented 4 weeks ago

Thank you for suggesting workaround! Could you explain or guess why your solution resolve the timeout issue?

edamame8888 commented 3 weeks ago

OK! I'll respond after work, so please wait a bit.

edamame8888 commented 3 weeks ago

Performance Before & After

Could you explain or guess why your solution resolve the timeout issue?

The memory being maxed out at 100% is the cause.

Before the improvement, the memory performance in Cloud Watch Insight was maxed out and the content retrieval never finished. After the improvement, memory performance improved and we confirmed that the embedding was completed in a short period of time.

スクリーンショット 2024-06-06 7 40 26