Persist scraped web pages

Aider-AI / aider

aider is AI pair programming in your terminal

https://aider.chat/

Apache License 2.0

22.62k stars 2.1k forks source link

Persist scraped web pages #530

Open damms005 opened 7 months ago

damms005 commented 7 months ago

This is in furtherance of #400

It will be a good addition to not have to re-scrape same webpage over and over, as it is wasteful etc.

Scraped pages should persist perhaps in a system-wide context such that subsequent calls to /web http://already-scraped.com/specific-page will only re-scrape if not already scraped or if user specifically asks to, perhaps a switch to be provided to the /web command.

Many thanks for the awesome job!

paul-gauthier commented 7 months ago

Thanks for trying aider and filing this issue.

Re-scraping a webpage should only take a moment, and ensures you have a fresh copy of the data it contains. Persisting or caching the content could lead to problems with not picking up new page content.

Can you help me understand the problem you are having with re-scraping?

damms005 commented 7 months ago

Agreed. Although "should only take a moment" when done multiple times a day adds up, esp if not on good connection.

My specific use-case is when I need to use specific features of tools/frameworks like Laravel or Filament. I find myself needing to re-scrape in order to provide context to tasks.

I may also be using the tool wrong, yk 🤷‍♂️

nevercast commented 6 months ago

I wonder if this also fits into the broader RAG feature.

hargup commented 5 months ago

I feel optional caching can definitely help. Most of the pages programmers look at don't change frequently, and having an optional caching with TTL of say 7 days might be helpful in increasing the speed of aider.

nevercast commented 4 months ago

I wonder if it's worth just respecting the http cache headers from the server, for most servers that'll be sufficient - be it timed, etag, or other.