Closed TomZhaoJobadder closed 4 months ago
Did you change the embedder from the default cohere.embed-multilingual-v3
used in the example? Cohere has a 512-token context window, and ScrapeGraph should chunk the request accordingly by default. None of the embedders currently supported by ScrapeGraph for Bedrock has a context window of 2048 tokens, so I can't figure out what's being used, and neither can ScrapeGraph. If you didn't change it, then there's either something wrong with ScrapeGraph, or a new breaking change in the Bedrock API.
hi @f-aguzzi Thanks for getting back to me.
Here is my code `""" Basic example of scraping pipeline using SmartScraper """ import os from dotenv import load_dotenv from langchain_aws import BedrockEmbeddings from scrapegraphai.graphs import SmartScraperGraph from scrapegraphai.utils import prettify_exec_info import boto3 load_dotenv()
graph_config = {
"llm": {
"model": "bedrock/anthropic.claude-3-haiku-20240307-v1:0",
"temperature": 0.0,
},
"embeddings": {
"model": "bedrock/cohere.embed-english-v3"
}
}
smart_scraper_graph = SmartScraperGraph( prompt="List me all the job names from the page.",
#source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate",
source="https://perinim.github.io/projects/",
config=graph_config
) result = smart_scraper_graph.run() print(result)
graph_exec_info = smart_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))
`
I used bedrock/cohere.embed-english-v3, it should be similar to cohere.embed-multilingual-v3.
Please note in my code if I use
source="https://perinim.github.io/projects/",
it is working fine. But if I change it to
source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate",
I encountered the error I mentioned in the ticket.
Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html The token for each text is 512, and each token is 4 characters. So the maximum characters are 2048. If there is a long text string in AI scraper, it should be cut into several small chunks. Each of them should be less than 512 tokens (2048 characters) and then send them to Bedrock. Please refer to the code example here https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html#api-inference-examples-cohere-embed
I'll add this embedding model to the tokens dictionary, and it will be included in the next release. In the meanwhile, I'll post a temporary solution to this problem here in the comments in a few hours.
Wait, it' already in the tokens dictionary. This will need some proper debbuging. To narrow down the problem, do you know whether the other bedrock embedders work properly or not?
Thanks, hopefully you can fix it soon. Both bedrock/cohere.embed-english-v3 and cohere.embed-multilingual-v3 have the same issue. I didn't try other embedding models.
hi, I figured out the error, I will fix in the next days
I have the same error, which of embedding models should I use ??
`
embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
api_key="#######",
model_name="sentence-transformers/all-MiniLM-l6-v2"
)
graph_config = {
"llm": {
"api_key": "#######",
"model": "claude-3-haiku-20240307",
"max_tokens": 4000
},"embeddings": {
"model_instance": embedder_model_instance
}}
smart_scraper_graph = SmartScraperGraph( prompt="data needed from each page is: (Title, subtitle [if any], content [article or text])",
source="https://www.fm.gov.om/policy-ar/foreign-policy-ar/?lang=ar",
config=graph_config,
# schema=schema
)
result = smart_scraper_graph.run()
print(result)
`
Hi please update to the new version
Describe the bug
I followed the example of bedrock https://github.com/VinciGit00/Scrapegraph-ai/blob/main/examples/bedrock/smart_scraper_bedrock.py It was working in the first place. Then after I replace the url from source="https://perinim.github.io/projects/", to source="https://www.seek.com.au/jobs?page=1&sortmode=ListedDate", I got the following errors:
Traceback (most recent call last): File "c:\GitHub\job-scraper-poc\Test_Code\ai-scraper_bedrock_example.py", line 46, in
result = smart_scraper_graph.run()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\smart_scraper_graph.py", line 120, in run
self.final_state, self.execution_info = self.graph.execute(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 224, in execute
return self._execute_standard(initial_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 153, in _execute_standard
raise e
File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\graphs\base_graph.py", line 140, in _execute_standard
result = current_node.execute(state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\scrapegraphai\nodes\rag_node.py", line 118, in execute
index = FAISS.from_documents(chunked_docs, embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_core\vectorstores.py", line 550, in from_documents
return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_community\vectorstores\faiss.py", line 930, in from_texts
embeddings = embedding.embed_documents(texts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_aws\embeddings\bedrock.py", line 169, in embed_documents
response = self._embedding_func(text)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\GitHub\job-scraper-poc.venv\Lib\site-packages\langchain_aws\embeddings\bedrock.py", line 150, in _embedding_func
raise ValueError(f"Error raised by inference endpoint: {e}")
ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: #/texts/0: expected maxLength: 2048, actual: 19882, please reformat your input and try again.
Looks like current version of bedrock can't handle a website which has a long context of html?