Description

convert_nanogpt_weights had two issues:

It lacked the attention mask and the the IGNORE tensor.
It did not correctly handle the case where the nanogpt model was configured to not have biases in the linear layers. When trying to use the function, loading the converted weights into a HookedTransformer would fail for lack of the proper tensors. If we're not supposed to checkpoint the masking tensor, then there is a separate issue in which HookedTransformer won't load a checkpoint without it there.

I have not added any tests or re-written documentation. There are no existing tests and the only documentation that I can find pertaining to this issue is a comment that said the code worked both with and without biases.

Type of change

Please delete options that are not relevant.

[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation
[ ] My changes generate no new warnings
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] New and existing unit tests pass locally with my changes
[x] I have not rewritten tests relating to key interfaces which would affect backward compatibility