Query API - Implement Colly Web Scraping

ecsbeats commented 1 year ago

We are implementing Colly to get and parse text from web content. This will be implemented in query/search_clients/colly.go. For unit tests (in query/search_clients/colly_test.go), use httptest and mock web endpoints rather than mocking individual methods.

ErKiran commented 1 year ago

@ecsbeats What kind of web content? Can you add more description to this?

ecsbeats commented 1 year ago

@ErKiran TLDR: The web scraping component should be versatile enough to extract text and (provided) metadata from any webpage and return it in a structured format.

Purpose and Where it Fits

The purpose of the Colly component is to extract text content, links, and metadata from each page, building a corpus of data about a query that our Query component will search through and distill with language models. The general flow is as follows:

Language model calls the Query API
Query API uses Google Search's API to get search results
Query API calls Colly to scrape the links of the search results
Colly retrieves and parses each web page's content, extracting page content, links, and metadata (stylesheets, SEO tags, etc.)
Colly serializes and returns this data to the Query API
Query API does AI data distillation on the data and returns it to the user (see #5) We only need a scraping depth of one for the MVP, though as our data distillation process improves, this is subject to change.

Example Content

Types of content that Colly will scrape include:

Library documentation (ex. Flask Documentation)
API documentation (ex. Finnhub)
API specifications (ex. Wikipedia's OpenAI Specification)
News Articles (ex. CNBC)
General Articles (ex. Medium)
Informational Content (ex. College Board)

Return Format

You can use your judgment on what format you want to return it in. For example, maybe you want to cache the scraped content in files and link to those files in your response to the Query API, or you might want to return everything in a JSON object. We want the implementation details (such as caching) abstracted from the Query module, but past that, you can take creative liberties.

I hope this helps. Thank you for taking the time to contribute.

ecsbeats commented 1 year ago

@ErKiran Would you like me to assign this issue to you?

ecsbeats commented 1 year ago

Since I haven't heard back, I'll take this issue.

CSXL / solus