Sstobo / Site-Sn33k

A collection of tools to rip webpages, and clean them for pinecone
122 stars 28 forks source link

Website Scraper, And PDF Chunker/Vectorizer for Pinecone DB

This Python repository contains a set of scripts that allow you to scrape a website, clean the data, organize it, chunk it, and then vectorize it. The resulting vectors can be used for a variety of machine learning tasks, such as similarity search or clustering. Recently added a script to consume PDFs and add them to the training data as well.

Files

THESE FUNCTIONS CONSUME THE FILES THEY PROCESS (only in the websites and pdfs directories)

Requirements

Usage

We choose a site to rip Step 1

We see our 'website' folder filling up with files Step 2

We run the cleaner script Step 3

Files are normalized and cleaned up Step 4

Now we run chunker, and we can see our website files are now chunked and vectorized Step 5

Now we run our PDF muncher, and it will consume the Dungeons and Dragons monster manual pdf in our pfds folder. Step 6

Finally, we can see our vectorized training data contains the DnD content aswell! Step 7

Now we simply run vectorizor and our Pinecone DB will get updated.

  1. Clone the repository and navigate to the project directory.

  2. Install the required Python libraries using pip install -r requirements.txt.

  3. Set up your OpenAI and Pinecone API keys. 4: Download the website - copy and run the wget command: wget -r -np -nd -A.html,.txt,.tmp -P websites https://www.linkedin.com/in/sean-stobo/

  4. Run python cleaner.py to download and clean the website data. - This will break down the directory structure into on list of html docs.

  5. Run python chunker.py to split the text files into smaller chunks. This outputs train.json in the root

6.5. Run 'pdf-muncher.py' to convert the contents of the '/pdfs/' folder to serialized train.jsonl file in root.

  1. Run python vectorizor.py to create embeddings and index them using Pinecone. This will vectorize train.json

Note: Before running vectorizor.py, make sure to set up a Pinecone database with 1536 dimensions.