Closed jbloomAus closed 1 year ago
Some caveats: BERT is encoder-only, not encoder-decoder. It should be a HookedBERT or HookedEncoder class, in my opinion.
Note that post-LayerNorm means "LayerNorm applied after the layer". Confusingly, the LayerNormPre class means "LayerNorm with weights folded into the subsequent layer". Pre-LayerNorm when used in papers means LayerNorm at the start of a layer. shrug Naming conventions are hard man
On Wed, 26 Apr 2023 at 11:01, Joseph Bloom @.***> wrote:
[image: Screenshot 2023-04-26 at 8 00 40 pm] https://user-images.githubusercontent.com/69127271/234541975-b5f48a88-0354-4e2c-a060-0deb52d60de8.png Proposal
A glaring gap in models supported by TransformerLens is the lack of support for encoder/decoder models like BERT and T5. In order to rectify this, we need to build support for encoder-decoder models, cross attention and likely a few other details. Motivation
- I've been asked a number of times by people whether we have T5 or BERT support coming.
- I think this is a reasonable contribution for people skilling up. You can ask for help or collaborate with other people working on building it.
- I (Joseph) and likely others will be able to give feedback on your work and pitch in if needed!
Pitch
In order to complete this card we are going to build a class called HookedEncoderDecoder which can be instantiated and have the BERT or T5 weights from huggingface https://huggingface.co/bert-base-uncased loaded in. This is the same strategy the package uses for other models so there's nothing fundamentally new, only components we haven't yet built with hooks. Once we have a class that has corresponding weights as needed, we will want to ensure all components are "hooked" correctly, enabling methods like .run_with_cache to work. It's not as complicated as it sounds to get it working.
Possibly the "best" version of this is tricky but we're willing to tolerate some code duplication as long we're confident the implementation is accurate (ie: tests show the hooked activations are correct and inference is identical to what we get without using TransformerLens.
Tasks:
- Make a fork for BERT support (keep it synced if building this takes more than a moment)
- Write some tests to be used during development, these should cover any novel architecture components these will involve during development. I'd start with two placeholder tests. 1. for do you have a model architecture which you can load weights into. and 2. if you had that architecture, would it produce identical results to the hugging-face model.
- Load the BERT model, look at each component. Maybe read the BERT paper if needed to get context. Make a list of every nn.module that we don't currently support. Some items that will be on the list. 1. a linear block before the unembed, post-layerNorm (can't be unfolded RIP), sinusoidal positional embeddings. T5 might have other stuff.
- Keep in mind you might need a new config class to configure the entire architecture. If so, build it.
- If you can pass the test for the model doing inference correctly and show that you can use .run_activation_cache and all the other methods that work on hooked transformer (within reason), then we're likely ready to merge.
Some things to note:
- You should refine the to do list and add detail as you go.
- Writing tests should make your life easier and create feedback cycles. It will also make it easier if multiple people want to collaborate on this.
- A previous PR https://github.com/neelnanda-io/TransformerLens/pull/82 attempted to provide BERT support is here. Unfortunately with drift it was too complicated to Merge. To avoid this, I recommend either syncing regularly or getting someone else to finish the PR if you can't.
- Possibly quite useful, a link to the original ARENA materials where we did this https://github.com/callummcdougall/arena-v1/blob/main/w2d2/solutions_build_bert.py without the hooks during the bootcamp
Additional context
- BERT paper https://arxiv.org/abs/1810.04805
- BERT on huggingface https://huggingface.co/bert-base-uncased
- Related T5 Model issue https://github.com/neelnanda-io/TransformerLens/issues/94
Checklist
- I have checked that there is no similar issue https://github.com/Farama-Foundation/MiniGrid/issues in the repo ( required)
— Reply to this email directly, view it on GitHub https://github.com/neelnanda-io/TransformerLens/issues/258, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKJUT3735X5MX6SOC7DXDDW7VANCNFSM6AAAAAAXMGIZMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I'd like to work on this! I'll start looking into it tomorrow.
@rusheb nice! It would be a good idea to create a simple PR first that splits the components from https://github.com/neelnanda-io/TransformerLens/blob/main/transformer_lens/components.py into different files
Why? To me that seems worse - having all components in the same place, in a file where you can easy Ctrl+F and jump to definition seems great. And adding some BERTBlocks or CrossAttention components to that file seems totally fine to me
On Thu, 27 Apr 2023 at 07:45, Alan @.***> wrote:
@rusheb https://github.com/rusheb nice! It would be a good idea to create a simple PR first that splits the components from https://github.com/neelnanda-io/TransformerLens/blob/main/transformer_lens/components.py into different files
— Reply to this email directly, view it on GitHub https://github.com/neelnanda-io/TransformerLens/issues/258#issuecomment-1524874813, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKNFTZ46J5T6NWZWWVTXDIIXLANCNFSM6AAAAAAXMGIZMA . You are receiving this because you commented.Message ID: @.***>
If you are developing and end up with the same files open multiple times, this feels like a tell to me, but if that's not happening it's probably fine. Seems like personal preference?
Haha, uh oh, I just spent a couple of hours on this.
I see both sides of the argument. I think it's pythonic to have all the components in one file, and I'm also having some issues with getting the imports right when moving them to multiple files. OTOH it is a super long file and might make it easier to factor out duplication if components were organised into files.
In my implementation I was planning to put all components in a components
module so they would still be easy to find/search through. And obviously IDEs make searching pretty easy in either scenario.
I'll probably pause on this until we get to a resolution. Branch here if anyone wants to look: https://github.com/rusheb/TransformerLens/tree/rusheb-components
I think it's minor. If Neel doesn't want to move them then maybe best to just keep everything in the one file and focus on the important parts of this ticket?
Why? To me that seems worse - having all components in the same place, in a file where you can easy Ctrl+F and jump to definition seems great. And adding some BERTBlocks or CrossAttention components to that file seems totally fine to me
Yeah I get it's not crazy yet, although 800 lines is quite a bit for newcomers (as you have to skim a bit to understand what is there). It was really just a suggestion based on the direction of travel with the repo - as the methods start getting split up for unit tests, it's going to get into the thousands.
Anyway fine for leaving as is and sorry about the extra work @rusheb!
For what it's worth if we keep the split we can probably just drop in all these tests - https://github.com/skyhookadventure/transformer-from-scratch/tree/main/transformer_from_scratch/components/tests
Based on various offline discussions, the MVP of this will be BERT for Masked Language Modelling -- basically the same architecture as huggingface BertForMaskedLM: the core BERT module plus a language modelling head.
Other tasks/outputs (e.g. embeddings, next sentence prediction, causal language modelling) are being considered as out of scope for the first iteration
Hi, is there any updates? I do hope T5 can be supported. Thanks!
BERT support was added, but T5 was not, I believe (see HookedEncoder). We'd welcome a PR adding T5 if you're interested! I'm not sure there's the capacity to add it ourselves. What's the use case?
On Sat, 4 May 2024 at 02:12, fzyzcjy @.***> wrote:
Hi, is there any updates? I do hope T5 can be supported. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/neelnanda-io/TransformerLens/issues/258#issuecomment-2093937021, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKLH64QS6HCNOLS2QE3ZAQYXBAVCNFSM6AAAAAAXMGIZMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHEZTOMBSGE . You are receiving this because you commented.Message ID: @.***>
@neelnanda-io Thank you! I am interested and hope to squeeze some time trying that (but cannot guarantee).
What's the use case?
You know, many tasks such as machine translations uses encoder-decoder architecture instead of decoder-only.
Makes sense! Note that the nnsight library works out of the box for any PyTorch model. It's more lightweight than TransformerLens accordingly, but likely still useful
On Sat, 4 May 2024, 4:09 am fzyzcjy, @.***> wrote:
@neelnanda-io https://github.com/neelnanda-io Thank you! I am interested and hope to squeeze some time trying that (but cannot guarantee).
What's the use case?
You know, many tasks such as machine translations uses encoder-decoder architecture instead of decoder-only.
— Reply to this email directly, view it on GitHub https://github.com/neelnanda-io/TransformerLens/issues/258#issuecomment-2093979949, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKJV3LV5V2HRF2NS4UTZARGOJAVCNFSM6AAAAAAXMGIZMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHE3TSOJUHE . You are receiving this because you were mentioned.Message ID: @.***>
I see. Thank you!
Proposal
A glaring gap in models supported by TransformerLens is the lack of support for encoder/decoder models like BERT and T5. In order to rectify this, we need to build support for encoder-decoder models, cross attention and likely a few other details.
Motivation
Pitch
In order to complete this card we are going to build a class called HookedEncoderDecoder which can be instantiated and have the BERT or T5 weights from huggingface loaded in. This is the same strategy the package uses for other models so there's nothing fundamentally new, only components we haven't yet built with hooks. Once we have a class that has corresponding weights as needed, we will want to ensure all components are "hooked" correctly, enabling methods like .run_with_cache to work. It's not as complicated as it sounds to get it working.
Possibly the "best" version of this is tricky but we're willing to tolerate some code duplication as long we're confident the implementation is accurate (ie: tests show the hooked activations are correct and inference is identical to what we get without using TransformerLens.
Tasks:
Some things to note:
Additional context
Checklist