Improve dataset loading

It should be easier to load the initial embedding datasets into a Vigil install. Using the huggingface datasets library will work. One function to download and load the datasets into chroma, optionally save the dataset to disk.

This will avoid the git clone and parquet2vdb steps entirely. The main app could even check if it’s the first run and load the default datasets if so (or some similar workflow.. whatever makes sense). Users can then use the same function to load new datasets from HF.

While I’m at it, I should allow loading datasets with user-defined column names. Right now the loader is looking for a specific format, but this could be more flexible.

deadbits / vigil-llm

Improve dataset loading #55