BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.59k stars 1.97k forks source link

[FR] Optimization of Data Formatting for Custom GPT #81

Open Snowzer91 opened 10 months ago

Snowzer91 commented 10 months ago

Context: The current generation of the JSON database for Custom GPT produces redundant data, particularly in common parts of HTML content.

Proposal: Integrate a feature for optimization based on a hashing technique to minimize tokens in HTML. This approach should not only identify common parts of HTML but also find the most optimized hashes to reduce the total number of tokens.

Technical Advantages:

  1. Space Optimization: Hash-based deduplication minimizes data replication, significantly reducing the number of tokens and the overall weight of the JSON file.
  2. Storage Efficiency: Hash representation allows storing common parts only once, saving space and improving storage efficiency.
  3. Lightweight Transmission: The resulting lightweight file facilitates data transmission, reducing transfer times and enhancing performance.

Proposed Operation: Each common part is subjected to a hashing function, generating a unique key. However, the hashing algorithm must be optimized to minimize the number of tokens. These hash keys are then stored in an array, while the original values of the title, URL, and HTML content in the JSON database refer to these keys.

Concrete Example: Consider two articles with similar HTML content containing common parts:

  1. Article on Artificial Intelligence:

    • Title: "Article on Artificial Intelligence"
    • URL: "https://example.com/article1"
    • HTML: "Artificial intelligence (AI) is a discipline of computer science revolutionizing many sectors. Welcome to our site."
  2. In-Depth Exploration of AI:

    • Title: "In-Depth Exploration of AI"
    • URL: "https://example.com/article2"
    • HTML: "Artificial intelligence, also known as AI, is a discipline of computer science. Welcome to our site."

Identify common parts in HTML and apply an optimized hashing algorithm to minimize tokens.


// Optimized database
[
  {
    "title": "Article on Artificial Intelligence",
    "url": "https://example.com/article1",
    "html_hash": ["a1b2c3", "d4e5f6", "g7h8i9", "j10k11l12", "m13n14o15", "p16q17r18"]
  },
  {
    "title": "In-Depth Exploration of AI",
    "url": "https://example.com/article2",
    "html_hash": ["a1b2c3", "s19t20u21", "g7h8i9", "j10k11l12", "v22w23x24"]
  }
]

// Table of hashed common phrases
{
  "a1b2c3": "Artificial intelligence",
  "d4e5f6": " (AI)",
  "g7h8i9": " is a discipline of computer science...",
  "j10k11l12": "...",
  "m13n14o15": "revolutionizing many sectors...",
  "p16q17r18": "...",
  "s19t20u21": "also known as AI...",
  "v22w23x24": "..."
}