CORE: Adding data source for web scraping

docqai / docq

Private ChatGPT/Perplexity. Securely unlocks knowledge from confidential business information.

https://docqai.github.io/docq/

GNU Affero General Public License v3.0

53 stars 10 forks source link

CORE: Adding data source for web scraping #48

Closed cwang closed 1 year ago

cwang commented 1 year ago

Is your feature request related to a problem? Please describe.

To support #23 we also need to be able to do web scraping for a space as data source.

Describe the solution you'd like

Use something like https://llama-hub-ui.vercel.app/l/web-beautiful_soup_web to scrape designated (part of a) website.

Describe alternatives you've considered

Could hand-build something using BS too I guess.

There are also a few other scrapers available for LlamaIndex such as https://llama-hub-ui.vercel.app/l/web-simple_web or https://llama-hub-ui.vercel.app/l/web-async_web

Additional context

Utilising the existing space data source framework internally.

janaka commented 1 year ago

@cwang what should we actually call this data source?

website
websites (should this one be multiple?)
web scrape (feels a bit off)

cwang commented 1 year ago

For the class name it could be WebScraper and for the user friendly name, could be "Web Scraper"? I think the people setting this up would be savvy enough to understand it.

janaka commented 1 year ago

"Web Crawler" is the option I can think of assuming we want to actually provide crawling rather than single page.

cwang commented 1 year ago

Probably a good name 20 years ago in the dotcom boom :) Interestingly nobody talks about crawling anymore

janaka commented 1 year ago

it's what happens though :) crawl links and scrape the content. The devs are too young to know lol

janaka commented 1 year ago

Space : website - 1:1 or 1:many ?

each web page on a site = a llama_index.Document or whole website = a llama_index.Document? The latter is how the web loaders in the hub work.

I'm sure the default behaviour would be sufficient for now. That is:

Space data source can be configured with one or more website URLs. Each website is indexed as a single document (with reference links to each page embedded). The space document list view shows the list of websites, not web pages, that have been indexed.

I'm unsure if there's a benefit to indexing each page as a document. Maybe down the road it's an indexing optimisation for reindexing, sowe can only reindex pages that have changed.

What do you think?

cwang commented 1 year ago

I'd prefer one doc per webpage, for the use case of search and reference back, unless there's a way to link back without treating them as individual docs. I'd also prefer one URL per space, with an optional regex to filter/limit the scraper not going beyond scope?

janaka commented 1 year ago

This is how I originally thought it should work. The fact that they haven't done it like this makes me wonder and it's all 3.

There's seems to be an option that includes page URL as reference like I said but not sure how that works in the responses. I'll check it out.

cwang commented 1 year ago

Not too concerned about how docs are indexed but the minimal requirement should be that we can link back to relevant webpages.