Compare firecrawl scraping

agentcoinorg / predictionprophet

An AI agent that can predict the future.

https://warpcast.com/predict

MIT License

38 stars 10 forks source link

Compare firecrawl scraping #96

Open kongzii opened 1 month ago

kongzii commented 1 month ago

https://www.firecrawl.dev/

vercel[bot] commented 1 month ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
evo-prophet	❌ Failed (Inspect)			Jun 7, 2024 10:00am

kongzii commented 1 month ago

The first difference is that this, by default, scrapes the whole page in a structured way, which can be good in some cases, but in this form it just adds complexity for LLM to extract it:

Screenshot by Dropbox Capture

kongzii commented 1 month ago

However, it's possible to add

params = {
        "pageOptions": {
            "onlyMainContent": True
        }
    }

to scrap only main content, then it's better:

Screenshot by Dropbox Capture

kongzii commented 1 month ago

Tried LLM extraction:

params = {
    "pageOptions": {
        "onlyMainContent": True
    },
}
goal = "Deepfakes and the Election of 2024"
if goal:
    params["extractorOptions"] = {
        "mode": "llm-extraction",
        "extractionPrompt": goal,
        "extractionSchema": {
          "type": "object",
          "properties": {
            "information_related_to_prompt": {
                "type": "string"
            },
        },
        "required": [
            "information_related_to_prompt",
        ]
    }
    }

x = app.scrape_url("https://foundation.mozilla.org/en/blog/deepfakes-election-2024/", params)

Output for information_related_to_prompt is Deepfakes Are Getting Personal, Just In Time For Election Season, not very useful 😄

kongzii commented 1 month ago

I think I used LLM extraction wrongly, should be used to extract some specific information from the website:

params = {
    "pageOptions": {
        "onlyMainContent": True
    },
}
goal = "Extract relevant information according to the required schema. If the information isn't available, write 'N/A' in the field."
if goal:
    params["extractorOptions"] = {
        "mode": "llm-extraction",
        "extractionPrompt": goal,
        "extractionSchema": {
          "type": "object",
          "properties": {
            "article_authors": {
                "type": "string"
            },
            "election_dates": {
                "type": "string"
            },
        },
        "required": [
            "article_authors",
            "election_dates",
        ]
    }
    }

x = app.scrape_url("https://foundation.mozilla.org/en/blog/deepfakes-election-2024/", params)

pprint(x['llm_extraction'])

prints

{'article_authors': 'Xavier Harding', 'election_dates': '2024'}

which is nice; however, this is only one OpenAI call that we could do by ourselves, but here, it costs 50 credits (instead of 1 for scraping).

kongzii commented 1 month ago

We can also specify what subpages to include/exclude/depth-limit:

"crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "maxDepth": 3,
        "mode": "fast",
      }

kongzii commented 1 month ago

I'm not sure in the end; perhaps as a tool for general agents, it would be nice, or if we have some scraping problems in the future, we can give it a try. And seems to me that the LLM side of it isn't worth it, but the auto crawling and the fact that it should also get javascript pages and outputs nice Markdown is nice.