Open kongzii opened 1 month ago
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
evo-prophet | ❌ Failed (Inspect) | Jun 7, 2024 10:00am |
The first difference is that this, by default, scrapes the whole page in a structured way, which can be good in some cases, but in this form it just adds complexity for LLM to extract it:
However, it's possible to add
params = {
"pageOptions": {
"onlyMainContent": True
}
}
to scrap only main content, then it's better:
Tried LLM extraction:
params = {
"pageOptions": {
"onlyMainContent": True
},
}
goal = "Deepfakes and the Election of 2024"
if goal:
params["extractorOptions"] = {
"mode": "llm-extraction",
"extractionPrompt": goal,
"extractionSchema": {
"type": "object",
"properties": {
"information_related_to_prompt": {
"type": "string"
},
},
"required": [
"information_related_to_prompt",
]
}
}
x = app.scrape_url("https://foundation.mozilla.org/en/blog/deepfakes-election-2024/", params)
Output for information_related_to_prompt
is Deepfakes Are Getting Personal, Just In Time For Election Season
, not very useful 😄
I think I used LLM extraction wrongly, should be used to extract some specific information from the website:
params = {
"pageOptions": {
"onlyMainContent": True
},
}
goal = "Extract relevant information according to the required schema. If the information isn't available, write 'N/A' in the field."
if goal:
params["extractorOptions"] = {
"mode": "llm-extraction",
"extractionPrompt": goal,
"extractionSchema": {
"type": "object",
"properties": {
"article_authors": {
"type": "string"
},
"election_dates": {
"type": "string"
},
},
"required": [
"article_authors",
"election_dates",
]
}
}
x = app.scrape_url("https://foundation.mozilla.org/en/blog/deepfakes-election-2024/", params)
pprint(x['llm_extraction'])
prints
{'article_authors': 'Xavier Harding', 'election_dates': '2024'}
which is nice; however, this is only one OpenAI call that we could do by ourselves, but here, it costs 50 credits (instead of 1 for scraping).
We can also specify what subpages to include/exclude/depth-limit:
"crawlerOptions": {
"includes": ["/blog/*", "/products/*"],
"maxDepth": 3,
"mode": "fast",
}
I'm not sure in the end; perhaps as a tool for general agents, it would be nice, or if we have some scraping problems in the future, we can give it a try. And seems to me that the LLM side of it isn't worth it, but the auto crawling and the fact that it should also get javascript pages and outputs nice Markdown is nice.
https://www.firecrawl.dev/