ScrapeGraphAI / Scrapegraph-ai

Python scraper based on AI
https://scrapegraphai.com
MIT License
14.26k stars 1.16k forks source link

Support for firecrawl #493

Closed AmosDinh closed 1 week ago

AmosDinh commented 1 month ago

Would be interesting if support was added for firecrawl.ai. They also allow to self host their service. Firecrawl allows for cleaner crawling, they handle pdf's as well as dynamic websites.

AmosDinh commented 1 month ago

I tested it and it is a lot faster than standard chromium.py. You can check for yourself here: https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/pre/beta...AmosDinh:Scrapegraph-ai:pre/beta

you need to add a .env file like this (if self hosting): FIRECRAWL_API_URL=http://localhost:3002 FIRECRAWL_API_KEY=YOUR_API_KEY_NOT_NEEDED

or if you want to quickly test out with the free tier of firecrawl: FIRECRAWL_API_KEY=YOUR_API_KEY

you need to add this in the scrapegraphai configs: "loader_kwargs": { "scraping_backend': 'firecrawl' },

future option would be to add arguments firecrawl_api_key and firecrawl_api_url to loarder_kwargs. The langchain class unfortunately does not support passing the url as argument yet, i have submitted a pull request.

e.g.:

from scrapegraphai.graphs import SearchGraph

graph_config = {
     "llm": {
        "model": "ollama/llama3.1",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
    },
    "max_results": 5,
    "verbose": True,
    "headless":True,
    "loader_kwargs": {

        'scraping_backend': 'firecrawl'
    },
}

search_graph = SearchGraph(
    prompt="Extract information regarding Macbook Pro m1 versions (year, price, specs, etc.). Be specific with the versions and make sure to include all. List every model configuration separately",
    config=graph_config
)

result = search_graph.run()
print(result)
AmosDinh commented 1 month ago

Not sure if you want to add firecrawl-py to requirements.txt. And docs could be updated. are you interested in this?

VinciGit00 commented 1 month ago

yes, we could be interested

AmosDinh commented 1 month ago

Ok great, although I won't be able to integrate everything. I don't have the full overview over features in firecrawl as well as scrapegraph. So e.g. besides the changes I made in my branch, you could integrate the pdf extraction etc.

Generally, this should be beneficial to anyone, since it is a lot faster to crawl with their method. I think they use playwright internally as well but am not sure why it is faster.

f-aguzzi commented 1 month ago

If you integrated even a small part of it and could make a pull request, it would be greatly appreciated. We're getting more and more feature requests along with less and less contributions, so anyone willing to do their homework is automatically our hero :)

f-aguzzi commented 1 month ago

Also, we use Rye as our python project configuration tool. If you want to add Firecrawl as a dependency, either add it to the pyproject.toml file at the root of the project, or through the Rye CLI. requirements.txt is built from pyproject.toml using a script. Sorry that this isn't specified anywhere - we need to update the contributing guidelines

angelotc commented 4 weeks ago

why not just use jina? it's free and easier to use

here's an example:

import asyncio

async def process_query(query):
    url_encoded_query = urllib.parse.quote(query)
    print(url_encoded_query)
    smart_scraper_graph = SmartScraperGraph(
        prompt="Find the yelp link, name, website, number of average yelp reviews , summary of yelp_reviews, specialties, phone,  and their website",
        source=f"https://s.jina.ai/{url_encoded_query}",
        config=graph_config,
        schema = Contractor
    )

    result =   smart_scraper_graph.run()
    print(result)

    return result

async def main(queries):
    results = await asyncio.gather(*(process_query(query) for query in queries))
    return results
AmosDinh commented 3 weeks ago

Of course this is possible to implement too. I just don't use it because it's not open source and has a rate limit.

Maybe you can implement it? Probably a good option too

f-aguzzi commented 1 week ago

Closing this issue, as the pull request was rejected by the repo owners.