Closed AmosDinh closed 1 week ago
I tested it and it is a lot faster than standard chromium.py. You can check for yourself here: https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/pre/beta...AmosDinh:Scrapegraph-ai:pre/beta
you need to add a .env file like this (if self hosting): FIRECRAWL_API_URL=http://localhost:3002 FIRECRAWL_API_KEY=YOUR_API_KEY_NOT_NEEDED
or if you want to quickly test out with the free tier of firecrawl: FIRECRAWL_API_KEY=YOUR_API_KEY
you need to add this in the scrapegraphai configs: "loader_kwargs": { "scraping_backend': 'firecrawl' },
future option would be to add arguments firecrawl_api_key and firecrawl_api_url to loarder_kwargs. The langchain class unfortunately does not support passing the url as argument yet, i have submitted a pull request.
e.g.:
from scrapegraphai.graphs import SearchGraph
graph_config = {
"llm": {
"model": "ollama/llama3.1",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
"base_url": "http://localhost:11434", # set Ollama URL
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"max_results": 5,
"verbose": True,
"headless":True,
"loader_kwargs": {
'scraping_backend': 'firecrawl'
},
}
search_graph = SearchGraph(
prompt="Extract information regarding Macbook Pro m1 versions (year, price, specs, etc.). Be specific with the versions and make sure to include all. List every model configuration separately",
config=graph_config
)
result = search_graph.run()
print(result)
Not sure if you want to add firecrawl-py to requirements.txt. And docs could be updated. are you interested in this?
yes, we could be interested
Ok great, although I won't be able to integrate everything. I don't have the full overview over features in firecrawl as well as scrapegraph. So e.g. besides the changes I made in my branch, you could integrate the pdf extraction etc.
Generally, this should be beneficial to anyone, since it is a lot faster to crawl with their method. I think they use playwright internally as well but am not sure why it is faster.
If you integrated even a small part of it and could make a pull request, it would be greatly appreciated. We're getting more and more feature requests along with less and less contributions, so anyone willing to do their homework is automatically our hero :)
Also, we use Rye as our python project configuration tool. If you want to add Firecrawl as a dependency, either add it to the pyproject.toml
file at the root of the project, or through the Rye CLI. requirements.txt
is built from pyproject.toml
using a script. Sorry that this isn't specified anywhere - we need to update the contributing guidelines
why not just use jina? it's free and easier to use
here's an example:
import asyncio
async def process_query(query):
url_encoded_query = urllib.parse.quote(query)
print(url_encoded_query)
smart_scraper_graph = SmartScraperGraph(
prompt="Find the yelp link, name, website, number of average yelp reviews , summary of yelp_reviews, specialties, phone, and their website",
source=f"https://s.jina.ai/{url_encoded_query}",
config=graph_config,
schema = Contractor
)
result = smart_scraper_graph.run()
print(result)
return result
async def main(queries):
results = await asyncio.gather(*(process_query(query) for query in queries))
return results
Of course this is possible to implement too. I just don't use it because it's not open source and has a rate limit.
Maybe you can implement it? Probably a good option too
Closing this issue, as the pull request was rejected by the repo owners.
Would be interesting if support was added for firecrawl.ai. They also allow to self host their service. Firecrawl allows for cleaner crawling, they handle pdf's as well as dynamic websites.