[ ] Is there a way to fetch multiple embeddings at once so it's not all done serially
[x] Add a mode to incrementally add more content to the pkl, for example when an export has new chunks added? (Also ideally it would be resilient to content changes via a hash of the content or something to see when an embedding change is necessary)
[x] Update tool to expect content in the canonical library format (but minus version, embedding_model, etc). (Update card-web export to export in that format, too)
[x] Update tool to be able to run in 'add missing embeddings' mode. It should be able to take in a pkl or json file, and then also take IN the output file (factoring out the rest of the path if in a different directory). Add a base argument, which is the file to base the output off of. It defaults to the final output filename if not provided, and if it doesn't point to an existing file just uses the empty library as base.
[ ] Give a percent complete for each embedding/token_count (either fetch the chunks first to figure out their length or move to an iterator style)
[ ] If the text has changed, redo token_count and embedding. And if the info differs, resave it.
To start, it should produce a
.pkl
file according to the format described in https://github.com/dglazkov/polymath/blob/main/format.mdIt should accept a JSON document with a structure like: