ApolloResearch / rib

Library for methods related to the Local Interaction Basis (LIB)
MIT License
2 stars 0 forks source link

Use separate (larger) dataset for gram (and mean) matrices #344

Closed stefan-apollo closed 4 months ago

stefan-apollo commented 4 months ago

Separate gram loader

Description

Allow a separate dataset to be used for the gram matrix computation than for the RIB basis computation.

I also allow using a tokenized dataset rather than untokenized dataset to skip the (kinda slow) tokenization.

Also added an option to store the computed gram matrix to a file, that code doesn't feel super great and it'll need to be merged with #333 but it's there!

Motivation and Context

We noticed that the gram (PCA) dataset size is a lot more sensitive to amount of samples, and also a lot cheaper.

How Has This Been Tested?

Did runs, and scaling plots. Added a test making sure this config option runs.

Does this PR introduce a breaking change?

No. Not giving a gram_dataset defaults to using the same dataset as for the Cs.

stefan-apollo commented 4 months ago

I notice load_interaction_rotations and load_mean_vectors_and_gram_matrices may be able to be the same function somehow but that's too much for me right now

stefan-apollo commented 4 months ago

It would be nice to assert that all the configs match

stefan-apollo commented 4 months ago

Todo: Test that the verification will not give warnings now when I run things properly

Edit: Especially when I run gradient flow

stefan-apollo commented 4 months ago

Existing issues with verify function:

  1. I currently print a warning rather than a ValueError because most of our existing files would break
  2. The list of attributes is a horrible hard-code, and it should be really easy to just programatically get a list Fixed!