c2siorg / Project-Explainer

Set of tools to explain github repositories using large language models
https://huggingface.co/spaces/SriPravallikaB/projectexplainer
Apache License 2.0
16 stars 15 forks source link

feat: Scrapping Documentations from website for building Knowledge Graphs #42

Open debrupf2946 opened 4 weeks ago

debrupf2946 commented 4 weeks ago

Implement Python Script for Scraping Documentation into Llama-Index Document Objects

Description

This issue invites contributors to develop a Python script that scrapes documentation from websites

Objective

Requirements

  1. Scraping the Documentation:

    • Utilize Python libraries like Beautiful Soup, Scrapy, or Requests-HTML to scrape content from the main documentation page and all associated sublinks.
    • Ensure accurate extraction of relevant content, including text, code snippets, and descriptions.
  2. Llama-Index Document Object Creation:

    • Store the data scraped from each individual link in a separate Llama-Index document object.
    • Attach metadata to each document object that records the URL of the link from which the content was scraped.
    • Compile all individual document objects into a list, representing the complete Llama-Index.
  3. Documentation:

    • Document the script clearly, providing instructions on how to use it.
    • note-book implementation for various strategies and responses mostly research
    • Later we can build a module out of it.
  4. Error Handling:

    • Implement robust error handling to manage issues such as broken links, failed requests, or unexpected data formats.

Submission Guidelines

Resources

Submission Checklist

We look forward to your valuable contributions that will enhance our capability to integrate website documentation into our knowledge systems!

SarangShelke2304 commented 1 week ago

hi, is anyone working on this? can i do it?