Closed danbraunai-apollo closed 5 months ago
Some other dataset related improvements that could be added (feel free to skip)
- [ ] Caching downloaded data (and models?) in some folder on the ssd drive. Best case we can set transformers to offline mode and shave a second off of imports (nice for fast tests!)
Yeah this would be nice. I haven't looked at "offline mode" before. Not going to look into it now, but made an issue for it.
Several changes since last review. Best for you to have a look. Most notably, as discussed, the test_stochastic_basis_tinystories
should be made more robust to different seeds and/or n_samples in the dataset config.
Add n_documents option for HF datasets
Description
How Has This Been Tested?
test_data.test_invalid_hf_dataset_config
for checking the pydantic validation of HFDatasetConfign_samples
andseed
arguments.Does this PR introduce a breaking change?
Yes.
return_set_n_samples
has changed names ton_samples
. Also, previously whenreturn_set_n_samples
was used in a pythia config, it actually specified the number of documents loaded from the dataset. It now specifies the number of n_ctx-length samples.Note that: