Currently we have the preprocessing scripts in preprocessing/ but to be able to fully reproduce the datasets we use, we need to make the process more consistent.
Ideally, we would want it to be a huggingface dataset that handles everything from download, cache, to preprocessing.
Currently we have the preprocessing scripts in
preprocessing/
but to be able to fully reproduce the datasets we use, we need to make the process more consistent.Ideally, we would want it to be a huggingface dataset that handles everything from download, cache, to preprocessing.