chengchingwen / Transformers.jl

Julia Implementation of Transformer models
MIT License
526 stars 75 forks source link

Loading old Transformer model built with Transformers@0.1.15 in Transformers@0.1.25 #125

Closed MNLubov closed 1 year ago

MNLubov commented 1 year ago

I currently trying to load old Transformer model build with Flux@0.12.10 and Transformers@0.1.15 in new Transformers@0.1.25. I tried different approaches proposed in https://discourse.julialang.org/t/how-to-load-bson-file-of-the-model-build-with-flux-0-12-10-to-use-with-flux-0-13-flux-diagonal-deprecated-problem/91588 However, I found that old version of Transformer model differs structurally from new version of Transformer model

Transformer@0.1.25 and Flux@0.13.10

typeof(model) = TransformerModel{CompositeEmbedding{Float32, NamedTuple{(:tok, :pe, :segment),   Tuple{Embed{Float32, Matrix{Float32}}, PositionEmbedding{Float32, Matrix{Float32}}, Embed{Float32,   Matrix{Float32}}}}, NamedTuple{(:tok, :pe, :segment), Tuple{typeof(+), typeof(+), typeof(+)}},  
 Positionwise{Tuple{LayerNorm{typeof(identity), Flux.Scale{typeof(identity), Vector{Float32}, Vector{Float32}}, Float32, 1}, Dropout{Float64, Colon, Random.TaskLocalRNG}}}}, Gpt{Stack{Symbol("x':x => 4"), NTuple{4, Transformer{Transformers.Basic.MultiheadAttention{Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dropout{Float64, Colon, Random.TaskLocalRNG}}, LayerNorm{typeof(identity), Flux.Scale{typeof(identity), Vector{Float32}, Vector{Float32}}, Float32, 1}, Transformers.Basic.PwFFN{Dense{typeof(gelu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, LayerNorm{typeof(identity), Flux.Scale{typeof(identity), Vector{Float32}, 
Vector{Float32}}, Float32, 1}, Dropout{Float64, Colon, Random.TaskLocalRNG}}}}, Dropout{Float64, Colon, Random.TaskLocalRNG}}, NamedTuple{(:pooler, :clf), Tuple{Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Chain{Tuple{Dropout{Float64, Colon, Random.TaskLocalRNG}, Dense{typeof(identity), Matrix{Float32}, 
Vector{Float32}}}}}}}
TransformerModel{CompositeEmbedding{Float32, NamedTuple{(:tok, :pe, :segment), Tuple{Embed{Float32, Matrix{Float32}}, PositionEmbedding{Float32, Matrix{Float32}}, Embed{Float32, Matrix{Float32}}}}, NamedTuple{(:tok, :pe, :segment), Tuple{typeof(+), typeof(+), typeof(+)}}, Positionwise{Tuple{LayerNorm{typeof(identity), Flux.Scale{typeof(identity), Vector{Float32}, Vector{Float32}}, Float32, 1}, Dropout{Float64, Colon, Random.TaskLocalRNG}}}}, Gpt{Stack{Symbol("x':x => 4"), NTuple{4, Transformer{Transformers.Basic.MultiheadAttention{Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dropout{Float64, Colon, Random.TaskLocalRNG}}, LayerNorm{typeof(identity), Flux.Scale{typeof(identity), Vector{Float32}, Vector{Float32}}, Float32, 1}, Transformers.Basic.PwFFN{Dense{typeof(gelu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, LayerNorm{typeof(identity), Flux.Scale{typeof(identity), Vector{Float32}, Vector{Float32}}, Float32, 1}, Dropout{Float64, Colon, Random.TaskLocalRNG}}}}, Dropout{Float64, Colon, Random.TaskLocalRNG}}, NamedTuple{(:pooler, :clf), Tuple{Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Chain{Tuple{Dropout{Float64, Colon, Random.TaskLocalRNG}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}}}

For the Transformer@0.1.15 model

typeof(model) = TransformerModel{CompositeEmbedding{Float32, NamedTuple{(:tok, :pe, :segment), Tuple{Embed{Float32, Matrix{Float32}}, PositionEmbedding{Float32, Matrix{Float32}}, Embed{Float32, Matrix{Float32}}}}, NamedTuple{(:tok, :pe, :segment), Tuple{typeof(+), typeof(+), typeof(+)}}, Positionwise{Tuple{LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Dropout{Float64, Colon}}}}, Gpt{Stack{NTuple{4, Transformer{Transformers.Basic.MultiheadAttention{Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dropout{Float64, Colon}}, LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Transformers.Basic.PwFFN{Dense{typeof(gelu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Dropout{Float64, Colon}}}, Symbol("x':x => 4")}, Dropout{Float64, Colon}}, NamedTuple{(:pooler, :clf), Tuple{Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Chain{Tuple{Dropout{Float64, Colon}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}}}
TransformerModel{CompositeEmbedding{Float32, NamedTuple{(:tok, :pe, :segment), Tuple{Embed{Float32, Matrix{Float32}}, PositionEmbedding{Float32, Matrix{Float32}}, Embed{Float32, Matrix{Float32}}}}, NamedTuple{(:tok, :pe, :segment), Tuple{typeof(+), typeof(+), typeof(+)}}, Positionwise{Tuple{LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Dropout{Float64, Colon}}}}, Gpt{Stack{NTuple{4, Transformer{Transformers.Basic.MultiheadAttention{Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dropout{Float64, Colon}}, LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Transformers.Basic.PwFFN{Dense{typeof(gelu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Dropout{Float64, Colon}}}, Symbol("x':x => 4")}, Dropout{Float64, Colon}}, NamedTuple{(:pooler, :clf), Tuple{Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Chain{Tuple{Dropout{Float64, Colon}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}}}

Main differences are replace of Flux.Diagonal layers on Flux.Scale, that is not a big problem itself. The main problem is in the GPT.Stack layer: in latest Transformers.jl (@0.1.25):

Gpt{Stack{Symbol("x':x => 4"), NTuple{4, Transformer{Transformers.Basic.MultiheadAttention{Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}

While in previous version Transformers@0.1.15

Gpt{Stack{NTuple{4, Transformer{Transformers.Basic.MultiheadAttention{Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}

Due to the difference in GPT.Stack layers, i.e. Gpt{Stack{Symbol("x':x => 4"), NTuple{4,... vs Gpt{Stack{NTuple{4,.., following error occurs:

TypeError: in Stack, in T, expected T<:Tuple, got a value of type Symbol
chengchingwen commented 1 year ago

Might need some manual work, but in general:

  1. In v0.1.15, load the model and extract the parameters (either with Functors or get_state_dict) and save the parameters (which should only include the arrays with any model type) to disk.
  2. In v0.1.25, load the parameters from step 1, and build the model with new type (either by calling the construct with arrays or construct the model with default initializer and manually assign the correct parameter to the correct array in the model)