Closed cwang closed 1 year ago
@cwang what should we actually call this data source?
For the class name it could be WebScraper
and for the user friendly name, could be "Web Scraper"? I think the people setting this up would be savvy enough to understand it.
"Web Crawler" is the option I can think of assuming we want to actually provide crawling rather than single page.
Probably a good name 20 years ago in the dotcom boom :) Interestingly nobody talks about crawling anymore
it's what happens though :) crawl links and scrape the content. The devs are too young to know lol
Space : website - 1:1 or 1:many ?
each web page on a site = a llama_index.Document
or whole website = a llama_index.Document
? The latter is how the web loaders in the hub work.
I'm sure the default behaviour would be sufficient for now. That is:
Space data source can be configured with one or more website URLs. Each website is indexed as a single document (with reference links to each page embedded). The space document list view shows the list of websites, not web pages, that have been indexed.
I'm unsure if there's a benefit to indexing each page as a document. Maybe down the road it's an indexing optimisation for reindexing, sowe can only reindex pages that have changed.
What do you think?
I'd prefer one doc per webpage, for the use case of search and reference back, unless there's a way to link back without treating them as individual docs. I'd also prefer one URL per space, with an optional regex to filter/limit the scraper not going beyond scope?
This is how I originally thought it should work. The fact that they haven't done it like this makes me wonder and it's all 3.
There's seems to be an option that includes page URL as reference like I said but not sure how that works in the responses. I'll check it out.
Not too concerned about how docs are indexed but the minimal requirement should be that we can link back to relevant webpages.
Is your feature request related to a problem? Please describe.
To support #23 we also need to be able to do web scraping for a space as data source.
Describe the solution you'd like
Use something like https://llama-hub-ui.vercel.app/l/web-beautiful_soup_web to scrape designated (part of a) website.
Describe alternatives you've considered
Could hand-build something using BS too I guess.
There are also a few other scrapers available for LlamaIndex such as https://llama-hub-ui.vercel.app/l/web-simple_web or https://llama-hub-ui.vercel.app/l/web-async_web
Additional context
Utilising the existing space data source framework internally.