ApolloResearch / rib

Library for methods related to the Local Interaction Basis (LIB)
MIT License
3 stars 0 forks source link

Add n_documents option for HF datasets #316

Closed danbraunai-apollo closed 5 months ago

danbraunai-apollo commented 6 months ago

Add n_documents option for HF datasets

Description

How Has This Been Tested?

Does this PR introduce a breaking change?

Yes. return_set_n_samples has changed names to n_samples. Also, previously when return_set_n_samples was used in a pythia config, it actually specified the number of documents loaded from the dataset. It now specifies the number of n_ctx-length samples.

Note that:

nix-apollo commented 6 months ago

Some other dataset related improvements that could be added (feel free to skip)

danbraunai-apollo commented 6 months ago
  • [ ] Caching downloaded data (and models?) in some folder on the ssd drive. Best case we can set transformers to offline mode and shave a second off of imports (nice for fast tests!)

Yeah this would be nice. I haven't looked at "offline mode" before. Not going to look into it now, but made an issue for it.

danbraunai-apollo commented 6 months ago

Several changes since last review. Best for you to have a look. Most notably, as discussed, the test_stochastic_basis_tinystories should be made more robust to different seeds and/or n_samples in the dataset config.