Use a hash of the text for the chunk ID

jkomoros commented 1 year ago

In the library format, chunks must have a unique ID. Currently they are an arbitrary string the library author uses, and by convention are things like the {article_slug}(_{article_chunk_index})?.

When we merge multiple libraries together, we have a TODO around "ensure the keys don't collide".

The main reason for the ID right now is to ensure that you can incrementally re-run the import of inbound content and note the chunks that are already there and not re-calculate embeddings (which is expensive). But with the way we're tweaking the chunker, you might still miss the ID if you happen to have a different index for the chunk within an article.

Finally, in #26 we're thinking about how to allow private content.

Something someone in Flux said the other day stuck with me: "Git was doing blockchain before crypto."

Git handles IDs of commits to uniquely reference a chunk. Why not do that?

Chunks would canonically have an ID that is the hash of their text content (we'd use whatever hash algorithm Git uses, SHA256?). That gives a deterministic hash of the content which helps handle collision problems with merged libraries. It also makes sure the same content always hashes to the same thing so you can detect the same content and just reuse the embeddings.

jkomoros commented 1 year ago

... Wait, I think this allows a lot of other things too. Text probably isn't a sufficient field to hash on, because you might find inane short chunks of content that show up in multiple libraries. but url + hash probably is unique. And there's almost always a unique and obviously canonical URL for a chunk of content.

This tying of URL to text chunk into a little verified package seems... really useful. Like, it seems like it will allow interesting provenance tracking of chunks, and will probably help with some of the private content controls. I bet you can do clever things like "prove you have access to the content with a given hash by telling me the URL it comes from" or something, which seems like it would allow interested federated actions without revealing information to people who don't have it yet.

jkomoros commented 1 year ago

I wonder if you should include anything else in the hash? Probably not the title/description/image_url where the chunk comes from, because those might change or be hard to deterministically extract for a given URL.

Probably not embedding and token_count, because those are so firmly tied to the embedding_model in use, and there will be multiple of those. But maybe there's some clever private information use case that is enabled if you hash the embedding, too? Something to think about...

dglazkov commented 1 year ago

timestamp, too?

jkomoros commented 1 year ago

What's the benefit of having the timestamp? Seems like another thing that's easy to accidentally have change (e.g. an edit vs an update, or sources that have no inherent notion of timestamp so the chunks are just given a default timestamp of when the import is run)

dglazkov commented 1 year ago

I was thinking of somehow capturing "the time of export", but that's probably not super-useful.

jkomoros commented 1 year ago

[ ] All importer scripts output chunk IDs that are canonical_id
[ ] Version 1 of the library requires chunk_ids to be canonical_id, and upconverts old libraries at load
[ ] Library.validate() gets a deep=False parameter that if true checks things like every chunk_id is the canonical_id_for_chunk
[ ] Note that _get_context and Library.slice() might truncate the last chunk; if so the id might be different...

dglazkov / polymath

Use a hash of the text for the chunk ID #33