danbraunai-apollo commented 9 months ago

Support return_set_frac and n_samples for all datasets

NOTE: Commented out tests.test_build_graph.test_modular_arithmetic_rotate_final_layer_invariance must be addressed before merging. @stefan-apollo would you mind looking at this, it doesn't seem to be passing for most of the configurations.

Description

Creates a DatasetConfig base class that defines return_set_frac and return_set_n_samples arguments and validates them.
All dataset configs inherit from this class (note there is an annoying mypy error that forces the declaration of these variables again in VisionDatasetConfig, it seems that when you do VisionDatasetConfig() it doesn't know that the returnset args are optional and expects them to be defined).
Speeds up tests by using less data where possible
Creates rib.utils.get_data_subset for randomly splitting a dataset with a fraction or n_samples.
- This was previously unsupported on ModularArithmeticDataset
- Previously, VisionDataset had a return_set_frac that took samples from the start of the dataset. This change will therefore be a breaking change for configs that previously defined a return_set_frac.
- HFDataset remains the same, taking the return_set_frac from the start or end of the dataset (handled when loading from HF).
When creating a Modadd dataset from a loaded model, use the cfg["dataset"]["seed"] instead of cfg["seed"] (#229)
For VisionDatasets and ModularArithmetic, use a random fraction or n_samples of the dataset instead of the first portion when specifying return_set_frac or return_set_n_samples. Note that HFDatasets will take the first portion with these args.
Force load_dataset to only return a Dataset, as opposed to Union[Dataset, tuple[Dataset, Dataset]]
- This gets rid of all @overload calls, yayy.
- Change the type of return_set from Union[Literal["train", "test", "all"], Literal["both"]] to Optional[Literal["train", "test", "all"]]
- When we actually need both train and test (like in train_modular_arithmetic), just call load_dataset twice with a different return set. So nothing needs to be passed in the config for this argument.
- Removes the "both" option from all configs (breaking change)

Note, it's possible that I can and should also define source, name and return_set inside the DatasetConfig base class, but I don't know how this works with pydantic when all the subclasses have different types and they must be defined in all subclasses.

Related Issue

Closes #234; Closes #229; Closes #238

How Has This Been Tested?

Added unittests for get_data_subset in tests.test_loader.test_get_data_subset.

Does this PR introduce a breaking change?

Yes.

No longer supports passing "both" as a return_set argument.
Previously, VisionDataset had a return_set_frac that took samples from the start of the dataset. This change will therefore be a breaking change for configs that previously defined a return_set_frac. VisionDataset is a new thing, and Nix said that this isn't a concern.

danbraunai-apollo commented 9 months ago

TODO:

[x] Take the return_set_frac or return_set_n_samples from the final subset (e.g. train, test, or both). Currently it's taken before the data is split up, which isn't good.
[x] Write unittests for _get_data_subset. Maybe even a test which checks properties of the dataset when specifying variations of return_set_frac and return_set_n_samples in an lm_rib_build config.

danbraunai-apollo commented 9 months ago

Generally these arguments are confusing to me, I'm not sure what is supported where.

I update these a little bit:

    tokenizer_name: str = Field(
        ...,
        description="The HuggingFace name for the tokenizer. Please check whether the tokenizer is "
        "compatible with the model you are using.",
    )
    return_set: Literal["train", "test"] = Field(
        ..., description="The dataset split to return from HuggingFace."
    )

What do you think? I don't think we can or want to list a full set of tokenizers the users can load, if that's what you were pointing at?

nix-apollo commented 9 months ago

Generally these arguments are confusing to me, I'm not sure what is supported where.

Sorry I meant the return_set and which dataset types supported both, etc. But it's much clearer now!

ApolloResearch / rib