ApolloResearch / rib

Library for methods related to the Local Interaction Basis (LIB)
MIT License
3 stars 0 forks source link

Support return_set_frac and n_samples for all datasets #238

Closed danbraunai-apollo closed 9 months ago

danbraunai-apollo commented 9 months ago

Support return_set_frac and n_samples for all datasets

NOTE: Commented out tests.test_build_graph.test_modular_arithmetic_rotate_final_layer_invariance must be addressed before merging. @stefan-apollo would you mind looking at this, it doesn't seem to be passing for most of the configurations.

Description

Note, it's possible that I can and should also define source, name and return_set inside the DatasetConfig base class, but I don't know how this works with pydantic when all the subclasses have different types and they must be defined in all subclasses.

Related Issue

Closes #234; Closes #229; Closes #238

How Has This Been Tested?

Does this PR introduce a breaking change?

Yes.

danbraunai-apollo commented 9 months ago

TODO:

danbraunai-apollo commented 9 months ago

Generally these arguments are confusing to me, I'm not sure what is supported where.

I update these a little bit:

    tokenizer_name: str = Field(
        ...,
        description="The HuggingFace name for the tokenizer. Please check whether the tokenizer is "
        "compatible with the model you are using.",
    )
    return_set: Literal["train", "test"] = Field(
        ..., description="The dataset split to return from HuggingFace."
    )

What do you think? I don't think we can or want to list a full set of tokenizers the users can load, if that's what you were pointing at?

nix-apollo commented 9 months ago

Generally these arguments are confusing to me, I'm not sure what is supported where.

Sorry I meant the return_set and which dataset types supported both, etc. But it's much clearer now!