Closed jkomoros closed 1 year ago
content
for legality and shape.content
field in library be renamed to chunks
? (Now captured in #23)My posts can run quite long, so I usually chunk them into multiple chunks, so url/image_url/title/description will get repetitive in the file. Maybe if we just gzip it, it will be okay? I am worried about the bloat.
Actually, I think it might be good-er to have URLs attached to chunks. Let's do it
Ah hmmm good point. And dictionaries vs tuples also lead to bloat.
Maybe we just take what we have and add a version and embedding model name:
{
version: 0,
embedding_model: 'text-embedding-ada-002',
embeddings: [
(
<text>,
<embedding>,
<tokens_length>,
<issue_id>
)
],
issue_info: {
<issue_id>: (
<url>
<image_url>,
<title>,
<description>
)
}
}
All of the actions tracked in this issue are now done or covered in other issues.
The current format is an odd legacy format.
A proposed better one: