c2siorg / Project-Explainer

Set of tools to explain github repositories using large language models
https://huggingface.co/spaces/SriPravallikaB/projectexplainer
Apache License 2.0
17 stars 16 forks source link

feat: a standard python module for data preparation #3

Closed sripravallikab closed 1 year ago

sripravallikab commented 1 year ago

The python module is meant to be part of the data preparation pipeline.

Functionalities :

  1. should be a importable python module
  2. should expose functionality to give git repo url as input
  3. fetches relevant data based on user's intent. For eg : README.md or files in some location
  4. gets the content from the files
  5. cleans the files such converting markdown to plain text, removing junk etc also extracting code comments and omitting code for code files.
  6. and returns output as file name and corresponding cleaned up data