TomZhaoJobadder commented 2 weeks ago

Describe the bug

I followed the example of bedrock https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/bedrock/smart_scraper_bedrock.py It was working in the first place. Then after I replace the url from source="https://perinim.github.io/projects/", to source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", I got the following errors:

Traceback (most recent call last): File "c:\GitHub\job-scraper-poc\Test_Code\ai-scraper_bedrock_example.py", line 46, in result = smart_scraper_graph.run() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 120, in run self.final_state, self.execution_info = self.graph.execute(inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 224, in execute return self._execute_standard(initial_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 153, in _execute_standard raise e File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 140, in _execute_standard result = current_node.execute(state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\nodes\rag_node.py", line 118, in execute index = FAISS.from_documents(chunked_docs, embeddings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_core\vectorstores.py", line 550, in from_documents return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_community\vectorstores\faiss.py", line 930, in from_texts embeddings = embedding.embed_documents(texts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_aws\embeddings\bedrock.py", line 169, in embed_documents response = self._embedding_func(text) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_aws\embeddings\bedrock.py", line 150, in _embedding_func raise ValueError(f"Error raised by inference endpoint: {e}") ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: #/texts/0: expected maxLength: 2048, actual: 19882, please reformat your input and try again.

Looks like current version of bedrock can't handle a website which has a long context of html?

f-aguzzi commented 2 weeks ago

Did you change the embedder from the default cohere.embed-multilingual-v3 used in the example? Cohere has a 512-token context window, and ScrapeGraph should chunk the request accordingly by default. None of the embedders currently supported by ScrapeGraph for Bedrock has a context window of 2048 tokens, so I can't figure out what's being used, and neither can ScrapeGraph. If you didn't change it, then there's either something wrong with ScrapeGraph, or a new breaking change in the Bedrock API.

TomZhaoJobadder commented 1 week ago

hi @f-aguzzi Thanks for getting back to me.

Here is my code `""" Basic example of scraping pipeline using SmartScraper """ import os from dotenv import load_dotenv from langchain_aws import BedrockEmbeddings from scrapegraphai.graphs import SmartScraperGraph from scrapegraphai.utils import prettify_exec_info import boto3 load_dotenv()

********

Define the configuration for the graph

********

graph_config = { "llm": {
"model": "bedrock/anthropic.claude-3-haiku-20240307-v1:0", "temperature": 0.0,
}, "embeddings": { "model": "bedrock/cohere.embed-english-v3"
} }

********

Create the SmartScraperGraph instance and run it

********

smart_scraper_graph = SmartScraperGraph( prompt="List me all the job names from the page.",

also accepts a string with the already downloaded HTML code

#source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate",
source="https://perinim.github.io/projects/",
config=graph_config

) result = smart_scraper_graph.run() print(result)

********

Get graph execution info

********

graph_exec_info = smart_scraper_graph.get_execution_info() print(prettify_exec_info(graph_exec_info)) ` I used bedrock/cohere.embed-english-v3, it should be similar to cohere.embed-multilingual-v3. Please note in my code if I use source="https://perinim.github.io/projects/",
it is working fine. But if I change it to source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", I encountered the error I mentioned in the ticket.

Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html The token for each text is 512, and each token is 4 characters. So the maximum characters are 2048. If there is a long text string in AI scraper, it should be cut into several small chunks. Each of them should be less than 512 tokens (2048 characters) and then send them to Bedrock. Please refer to the code example here https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html#api-inference-examples-cohere-embed

f-aguzzi commented 1 week ago

I'll add this embedding model to the tokens dictionary, and it will be included in the next release. In the meanwhile, I'll post a temporary solution to this problem here in the comments in a few hours.

f-aguzzi commented 1 week ago

Wait, it' already in the tokens dictionary. This will need some proper debbuging. To narrow down the problem, do you know whether the other bedrock embedders work properly or not?

TomZhaoJobadder commented 1 week ago

Thanks, hopefully you can fix it soon. Both bedrock/cohere.embed-english-v3 and cohere.embed-multilingual-v3 have the same issue. I didn't try other embedding models.

VinciGit00 commented 4 days ago

hi, I figured out the error, I will fix in the next days

mjid13 commented 2 days ago

I have the same error, which of embedding models should I use ??

`

embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
    api_key="#######",
    model_name="sentence-transformers/all-MiniLM-l6-v2"
)

graph_config = {
    "llm": {
        "api_key": "#######",
        "model": "claude-3-haiku-20240307",
        "max_tokens": 4000
    },"embeddings": {
        "model_instance": embedder_model_instance
    }}

smart_scraper_graph = SmartScraperGraph( prompt="data needed from each page is: (Title, subtitle [if any], content [article or text])",
   source="https://www.fm.gov.om/policy-ar/foreign-policy-ar/?lang=ar",
   config=graph_config,
   # schema=schema
)
result = smart_scraper_graph.run()
print(result)

`

ScrapeGraphAI / Scrapegraph-ai

BedRock Malformed input request: #/texts/0: expected maxLength: 2048, actual: 19882, please reformat your input and try agai #400

********

Define the configuration for the graph

********

********

Create the SmartScraperGraph instance and run it

********

also accepts a string with the already downloaded HTML code

********

Get graph execution info

********