Closed ecsbeats closed 1 year ago
@ecsbeats What kind of web content? Can you add more description to this?
@ErKiran TLDR: The web scraping component should be versatile enough to extract text and (provided) metadata from any webpage and return it in a structured format.
The purpose of the Colly component is to extract text content, links, and metadata from each page, building a corpus of data about a query that our Query component will search through and distill with language models. The general flow is as follows:
Types of content that Colly will scrape include:
You can use your judgment on what format you want to return it in. For example, maybe you want to cache the scraped content in files and link to those files in your response to the Query API, or you might want to return everything in a JSON object. We want the implementation details (such as caching) abstracted from the Query module, but past that, you can take creative liberties.
I hope this helps. Thank you for taking the time to contribute.
@ErKiran Would you like me to assign this issue to you?
Since I haven't heard back, I'll take this issue.
We are implementing Colly to get and parse text from web content. This will be implemented in
query/search_clients/colly.go
. For unit tests (inquery/search_clients/colly_test.go
), usehttptest
and mock web endpoints rather than mocking individual methods.