Update format - Githubissues

jkomoros commented 1 year ago

The current format is an odd legacy format.

A proposed better one:

{
  version: 0,
  //In the future other embedding models might be supported
  embedding_model: 'text-embedding-ada-002'
  content: {
    //A chunk_id is any string unique within this index to address your content. It could be a post's slug, a URL, or a monotonically-increasing integer formatted as a string.
    <chunk_id>: {
      text: <text>,
      embedding: <embedding>,
      token_count: <tokens_count>,
      info: {
        url: <url>,
        //All of the following properties are optional
        image_url: <image_url>,
        title: <title>,
        description: <description>
      }
    }
  }
}

jkomoros commented 1 year ago

[x] Write a validate_library(library) that ensures it is valid and raises an error if not.
[x] load_library should check to see fi the library is missing a version number or has -1, and if so up-convert it.
[x] Create conversion tool to take a file in the previous format and save in the new one.
[ ] Update validate_library to check all of the items in content for legality and shape.
[x] Should the content field in library be renamed to chunks? (Now captured in #23)
[x] Update _convert_library_from_version_og to handle the case where multiple embedding rows all have the same issue_id
[x] Rename 'get_issue' and related arguments to not call it an issue.
[x] Change get_similarities to return a thing that isn't a tuple (related to #14 and now captured in #18 )
[x] Create a validator to check the contents of a file and verify it's correct

dglazkov commented 1 year ago

My posts can run quite long, so I usually chunk them into multiple chunks, so url/image_url/title/description will get repetitive in the file. Maybe if we just gzip it, it will be okay? I am worried about the bloat.

dglazkov commented 1 year ago

Actually, I think it might be good-er to have URLs attached to chunks. Let's do it

jkomoros commented 1 year ago

Ah hmmm good point. And dictionaries vs tuples also lead to bloat.

Maybe we just take what we have and add a version and embedding model name:

{
  version: 0,
  embedding_model: 'text-embedding-ada-002',
  embeddings: [
    (
      <text>,
      <embedding>,
      <tokens_length>,
      <issue_id>
    )
  ],
  issue_info: {
    <issue_id>: (
      <url>
      <image_url>,
      <title>,
      <description>
    )
  }
}

jkomoros commented 1 year ago

All of the actions tracked in this issue are now done or covered in other issues.

dglazkov / polymath

Update format #12