BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.59k stars 1.97k forks source link

Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal #89

Closed Daethyra closed 9 months ago

Daethyra commented 10 months ago

This pull request

introduces significant enhancements to the conv_html_to_markdown.py module, integrating JinaAI's embeddings v2 (Small) model for advanced text processing. The primary goal is to refine the conversion of HTML content into Markdown format, with an added focus on removing redundant data using semantic analysis. Below is a detailed overview of the workflow, reasoning, steps taken, and the key features of this updated module.

Workflow and Reasoning:

Features Covered:

Task List:

Steps Taken:

  1. Module Setup: Initialized the module with BeautifulSoup and markdownify for HTML parsing and Markdown conversion.
  2. Embedding Integration: Integrated AutoTokenizer and AutoModel from the transformers library for embedding processing.
  3. Batch Processing Logic: Developed process_embeddings method to handle embeddings in batches, optimizing for large datasets.
  4. Semantic Analysis: Created remove_redundant_data method to filter out semantically similar content using cosine similarity on embeddings.
  5. Error Handling: Added comprehensive try-except blocks for resilience, particularly in critical methods like convert and curate_content.
  6. Testing and Validation: Conducted thorough testing to ensure the accuracy and efficiency of the embedding-based redundancy removal.

Future Improvements and Customization:

The module serves as a foundational framework for further customization and enhancement. Future iterations can explore:


This pull request is brought to you by: WELL IT WORKS ON MY MACHINE!

marcelovicentegc commented 10 months ago

Hey @Daethyra :wave:! It occurs to me that these changes could be introduced as a separate package on another repo)as something possibly complementary to the gpt-crawler package.

I miss some instructions on this PR:

Daethyra commented 9 months ago

Hey @marcelovicentegc,

Sorry for the late reply. I appreciate you having a look at the PR.

In regard to introducing the Python script as a separate package, I leave that to you -- I'm a noob and don't know what's best for production environments and what's sustainable for you and your team.

To answer your questions,

Acknowledgements: