dglazkov / polymath

MIT License
133 stars 9 forks source link

Update format #12

Closed jkomoros closed 1 year ago

jkomoros commented 1 year ago

The current format is an odd legacy format.

A proposed better one:

{
  version: 0,
  //In the future other embedding models might be supported
  embedding_model: 'text-embedding-ada-002'
  content: {
    //A chunk_id is any string unique within this index to address your content. It could be a post's slug, a URL, or a monotonically-increasing integer formatted as a string.
    <chunk_id>: {
      text: <text>,
      embedding: <embedding>,
      token_count: <tokens_count>,
      info: {
        url: <url>,
        //All of the following properties are optional
        image_url: <image_url>,
        title: <title>,
        description: <description>
      }
    }
  }
}
jkomoros commented 1 year ago
dglazkov commented 1 year ago

My posts can run quite long, so I usually chunk them into multiple chunks, so url/image_url/title/description will get repetitive in the file. Maybe if we just gzip it, it will be okay? I am worried about the bloat.

dglazkov commented 1 year ago

Actually, I think it might be good-er to have URLs attached to chunks. Let's do it

jkomoros commented 1 year ago

Ah hmmm good point. And dictionaries vs tuples also lead to bloat.

Maybe we just take what we have and add a version and embedding model name:

{
  version: 0,
  embedding_model: 'text-embedding-ada-002',
  embeddings: [
    (
      <text>,
      <embedding>,
      <tokens_length>,
      <issue_id>
    )
  ],
  issue_info: {
    <issue_id>: (
      <url>
      <image_url>,
      <title>,
      <description>
    )
  }
}
jkomoros commented 1 year ago

All of the actions tracked in this issue are now done or covered in other issues.