Plot training history of hidden representations in GPT

How quickly do different layers of GPT learn what they learn? Fix a test input and a point in time. The network calculates an activation vector per layer. (Then it's added onto that layer's input via skip connection.) Fix a layer and vary the point in time. Plot the path that this vector takes, perhaps as a heatmap of the dot product of any two snapshots. Do talk to Gurkenglas for rambling. This project should take a master developer about as much time as it took me to write this issue.

EleutherAI / project-menu

Plot training history of hidden representations in GPT #7