Curate HTML w/ Semantic similarity | JinaAI Embeddings v2 (Small) | Curate HTML to Markdown with JinaAI Embedding Processing for Redundancy Removal

Daethyra commented 10 months ago

This pull request

introduces significant enhancements to the conv_html_to_markdown.py module, integrating JinaAI's embeddings v2 (Small) model for advanced text processing. The primary goal is to refine the conversion of HTML content into Markdown format, with an added focus on removing redundant data using semantic analysis. Below is a detailed overview of the workflow, reasoning, steps taken, and the key features of this updated module.

Workflow and Reasoning:

Initial Goal: Improve the conversion of HTML to Markdown by eliminating redundant or semantically similar content.
Choice of Technology: Utilized JinaAI's embeddings v2 (Small) model because it is open-source, highly efficient in handling large text data and semantic text analysis capabilities, and it has an 8k maximum token context length.
Security Consideration: Implemented trust_remote_code=True cautiously to allow the execution of the model's custom code, acknowledging the potential risks involved. Passing this value is required for the model to run.
Thought Process: By integrating the conv_html_to_markdown.py module, we streamline the conversion process and intelligently curate the HTML content, paving the way for more refined and contextually relevant Markdown outputs.
- Semantic Similarity Threshold: I found most success with a float of 0.8699 after extensive methodical testing against the Hugging Face Pipelines documentation.
- Docker Integration: conv_html_to_markdown.py is seamlessly integrated into the root dir's Dockerfile, which has also been updated to install all necessary Python packages for the environment.

Features Covered:

HTML to Markdown conversion with tag stripping and link processing.
Semantic redundancy removal using JinaAI embeddings.
Batch processing of text for efficient embedding generation.
Resilient error handling with detailed logging.

Task List:

[x] Convert HTML content to Markdown format.
[x] Integrate JinaAI's embeddings for semantic analysis.
[x] Implement mean pooling and batch processing for embedding generation.
[x] Develop functionality to identify and remove redundant lines based on embeddings.
[x] Ensure robust error handling and logging throughout the module.

Steps Taken:

Module Setup: Initialized the module with BeautifulSoup and markdownify for HTML parsing and Markdown conversion.
Embedding Integration: Integrated AutoTokenizer and AutoModel from the transformers library for embedding processing.
Batch Processing Logic: Developed process_embeddings method to handle embeddings in batches, optimizing for large datasets.
Semantic Analysis: Created remove_redundant_data method to filter out semantically similar content using cosine similarity on embeddings.
Error Handling: Added comprehensive try-except blocks for resilience, particularly in critical methods like convert and curate_content.
Testing and Validation: Conducted thorough testing to ensure the accuracy and efficiency of the embedding-based redundancy removal.

Future Improvements and Customization:

The module serves as a foundational framework for further customization and enhancement. Future iterations can explore:

Tuning of the semantic similarity threshold for redundancy removal.
- I found most success with a float of 0.8699 after extensive methodical testing against the Hugging Face Pipelines documentation.
House scraped code blocks in markdown blocks with backticks
- Create visual divider of chunks or sections
- Could implement pagination of sorts instead

This pull request is brought to you by: WELL IT WORKS ON MY MACHINE!

marcelovicentegc commented 10 months ago

Hey @Daethyra :wave:! It occurs to me that these changes could be introduced as a separate package on another repo)as something possibly complementary to the gpt-crawler package.

I miss some instructions on this PR:

What are the requirements to run these changes? Do people need a HF API key to run this?
How does it integrate with gpt-crawler?

Daethyra commented 9 months ago

Hey @marcelovicentegc,

Sorry for the late reply. I appreciate you having a look at the PR.

In regard to introducing the Python script as a separate package, I leave that to you -- I'm a noob and don't know what's best for production environments and what's sustainable for you and your team.

To answer your questions,

Currently, the conv_html_to_markdown.py module runs as soon as the HTML scraping is finished.
Requirements:
1. Python: ">=3.9, <=3.10.12" | Successfully tested on 3.10 & 3.11
2. Packages: pip install -U beautifulsoup4 markdownify transformers torch
3. API Key?: No external API keys required as jina-embeddings-v2-small-en is open access on HF. It's not like Llama 2 where you need to request access. It's ready to go!
4. Integration: As normal, from inside the root directory of gpt-crawler, build and run the image. The Python script has already been pointed to in the Dockerfile's build process.
PowerShell: run docker build -t gpt-crawler . ; docker run -it gpt-crawler
Bash: run docker build -t gpt-crawler . && docker run -it gpt-crawler

Acknowledgements:

More enhancements to conv_html_to_markdown.py may be found on my fork's version.
I haven't thought of a way to quickly test results from conv_html_to_markdown.py because I rebuild the image every time I update the module's logic.
Adding this Python module does not produce ideal results; however, they are still an improvement upon the JSONic data for Custom GPTs, in my opinion.
I intend to add the following features at a later date, once Assistant Architect receives it's file-base enhancements from the Utilikit:

BuilderIO / gpt-crawler