I didn't realize how awful and broken the dataset loading process was. Hopefully this PR is a big improvement.
The utils directory has been removed and all dataset loading is now handled by loader.py.
Users pass a hugging face repo and Vigil config file, and everything else is handled. No more cloning the repos and using that parquet loader.
(venv) adam:vigil-llm/ (dataloader✗) $ python loader.py --help [0:13:19]
usage: loader.py [-h] -d DATASET -c CONFIG
Load text embedding data into Vigil
options:
-h, --help show this help message and exit
-d DATASET, --dataset DATASET
dataset repo name
-c CONFIG, --config CONFIG
config file
I didn't realize how awful and broken the dataset loading process was. Hopefully this PR is a big improvement.
The utils directory has been removed and all dataset loading is now handled by
loader.py
. Users pass a hugging face repo and Vigil config file, and everything else is handled. No more cloning the repos and using that parquet loader.