Closed Daethyra closed 9 months ago
Hey @Daethyra :wave:! It occurs to me that these changes could be introduced as a separate package on another repo)as something possibly complementary to the gpt-crawler package.
I miss some instructions on this PR:
Hey @marcelovicentegc,
Sorry for the late reply. I appreciate you having a look at the PR.
In regard to introducing the Python script as a separate package, I leave that to you -- I'm a noob and don't know what's best for production environments and what's sustainable for you and your team.
To answer your questions,
conv_html_to_markdown.py
module runs as soon as the HTML scraping is finished.pip install -U beautifulsoup4 markdownify transformers torch
gpt-crawler
, build and run the image. The Python script has already been pointed to in the Dockerfile
's build process.docker build -t gpt-crawler . ; docker run -it gpt-crawler
docker build -t gpt-crawler . && docker run -it gpt-crawler
Acknowledgements:
conv_html_to_markdown.py
may be found on my fork's version.conv_html_to_markdown.py
because I rebuild the image every time I update the module's logic.
This pull request
introduces significant enhancements to the
conv_html_to_markdown.py
module, integrating JinaAI's embeddings v2 (Small) model for advanced text processing. The primary goal is to refine the conversion of HTML content into Markdown format, with an added focus on removing redundant data using semantic analysis. Below is a detailed overview of the workflow, reasoning, steps taken, and the key features of this updated module.Workflow and Reasoning:
trust_remote_code=True
cautiously to allow the execution of the model's custom code, acknowledging the potential risks involved. Passing this value is required for the model to run.conv_html_to_markdown.py
module, we streamline the conversion process and intelligently curate the HTML content, paving the way for more refined and contextually relevant Markdown outputs.conv_html_to_markdown.py
is seamlessly integrated into the root dir's Dockerfile, which has also been updated to install all necessary Python packages for the environment.Features Covered:
Task List:
Steps Taken:
process_embeddings
method to handle embeddings in batches, optimizing for large datasets.remove_redundant_data
method to filter out semantically similar content using cosine similarity on embeddings.convert
andcurate_content
.Future Improvements and Customization:
The module serves as a foundational framework for further customization and enhancement. Future iterations can explore:
This pull request is brought to you by: WELL IT WORKS ON MY MACHINE!