Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.
https://useanything.com
MIT License
16.51k stars 1.74k forks source link

[FEAT]: Bulk Link scraper improvement suggestions #1570

Open wallartup opened 3 weeks ago

wallartup commented 3 weeks ago

How are you running AnythingLLM?

Docker (local)

What happened?

First of all, I love the idea of recurssively scraping a lot of content via a bulk link scraper.

I think it needs to be rethinked a little, I would recommend not giving people the ability to choose the amount of links to scrape but rather have a full recursive scrape of the whole website. The reason is that you already have the ability to choose what links you want to scrape IF you choose to go one on one.

I would argue that no one knows the amount of links they want to scrape (because you don't control what the scraper finds) within a full website so it becomes negligent because when you are scraping you want everything on that page or subdomain or folder.

Explanation:

  1. Many times you have a usecase where the whole website needs to be scraped, you would usually not scrape 10 or 50 pages (and not knowing which pages have been scraped) but rather everything that the "scraper" finds on that url. Make a recurssive scraper for everything! Remove the option to add amount of links
  2. When we are scraping we only get the popup saying "scraping, this might take some time". I would rather make this a brand new menu alternative in the backend called "Scraped websites" where you can see the amount of pages that are scraped, stop the scraping, pause the scraping or "play" the scraping again.
  3. These files does not need to land in the document folder, they should be in a separate folder and EMBEDDED automatically.
  4. Make scraped websites available via the settings menu scraped websites. Put a similar three dots as adding users to a workspace, this way you can now add different scraped websites to different workspaces outside of the document management solution.
  5. In the data connectors bulk links "menu" -> create a link to the admin pages to see the scapings that are happening, that are done and/or paused.

I tried adding 50000 links to scrape and I have no idea now how to shut it down :) haha. It even said scrape failed and then it continues scraping in the background. This is why the above is key.

Are there known steps to reproduce?

No response

SupercaliG commented 3 weeks ago

I second this!