Open rockerBOO opened 1 year ago
Ok did some further investigation.
For these they are all based on the default inference configs set inside
The inference config files:
These lines read the config in and load/call the modules in the config.
Classes inside CompVis/stable-diffusion
Inside this repo we have src/pipelines/stable_diffusion.rs which has loaders the VAE, UNet, Autoencoder, Scheduler, and CLIP (CLIP for 1.x and OpenCLIP for 2.x). Is this something where the LatentDiffusion model needs to be made to support this?
Ok, so it seems these checkpoint files have all the components inside, and are named in a way to rebuild from the python modules.
I took some time reading through the diffusers library some more (wasn't using this before, just the CompVis forks mostly). HF diffusers has a from_pretrained
which allows you to load a model from the hugging face website or a local version. Putting them on HF has you make a model_index.json
to help define the mappings to the components in HF diffusers and transformers.
To put these together, they have Diffusion pipelines. This allows various diffusion pipelines to be made, including the Stable Diffusion pipeline. The Stable Diffusion pipeline works to bring together these parts. This library has a stable_diffusion pipeline.
So, to get premade checkpoints that include all these various components in the pipeline, there needs to be a translation process that converts all the named parameters in the checkpoint to the right components (autoencoder, unet). I started to try and piece together the names to possible internal representation, but I do not have a clear enough picture to process it.
For example
model.diffusion_model.middle_block.1.transformer_blocks.0.attn2.to_out.0.weight Tensor[[1280, 1280], Half]
model.diffusion_model.middle_block.1.transformer_blocks.0.attn2.to_out.0.bias Tensor[[1280], Half]
Might be inside this data structure:
UNetMidBlock2DCrossAttn {
attn_resnets: vec![(
SpatialTransformer {
transformer_blocks: vec![BasticTransformerBlock {
attn1
ff
attn2: CrossAttention {
to_q
to_k
to_v
to_out: nn::Linear {
ws: Tensor
bs: Option<Tensor>
},
},
norm1
norm2
norm3
}, ...]
}, _
)]
}
But some / a few of the conversions are less clear.
Well, a little closer in my head, but let me know if any opinions about this thought process. Seems doable, but the translating between libraries and python/rust is taking me some time to understand.
Read a helpful post about Latent Diffusion that documented over the CompVis version, which highlighted what certain things are in the checkpoint files. Now it seems to all match up. Only thing that's a little off is there are no resnet inside the CompVis version but its there inside the local unet.
model.diffusion_model = unet
first_stage_model = autoencoder
cond_stage_model.transformer = CLIPTextEmbedder/ClipTextTransformer
cond_stage_model.transformer
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.embeddings.position_ids Tensor[[1, 77], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.embeddings.token_embedding.weight Tensor[[49408, 768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.embeddings.position_embedding.weight Tensor[[77, 768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.k_proj.weight Tensor[[768, 768], Half]
./data/analog-diffusion-1.0.ot: cond_stage_model.transformer.text_model.encoder.layers.0.self_attn.k_proj.bias Tensor[[768], Half]
matches up with clip model pytorch_model.ot
➜ cargo run --release --example tensor-tools ls ./data/pytorch_model.ot
Finished release [optimized] target(s) in 0.07s
Running `target/release/examples/tensor-tools ls ./data/pytorch_model.ot`
./data/pytorch_model.ot: text_model.embeddings.position_ids Tensor[[1, 77], Int64]
./data/pytorch_model.ot: text_model.embeddings.token_embedding.weight Tensor[[49408, 768], Float]
./data/pytorch_model.ot: text_model.embeddings.position_embedding.weight Tensor[[77, 768], Float]
./data/pytorch_model.ot: text_model.encoder.layers.0.self_attn.k_proj.bias Tensor[[768], Float]
first_stage_model
./data/analog-diffusion-1.0.ot: first_stage_model.encoder.conv_in.weight Tensor[[128, 3, 3, 3], Half]
./data/analog-diffusion-1.0.ot: first_stage_model.encoder.conv_in.bias Tensor[[128], Half]
./data/analog-diffusion-1.0.ot: first_stage_model.encoder.down.0.block.0.norm1.weight Tensor[[128], Half]
./data/analog-diffusion-1.0.ot: first_stage_model.encoder.down.0.block.0.norm1.bias Tensor[[128], Half]
Matches the vae.ot
:
./data/vae.ot: encoder.conv_in.weight Tensor[[128, 3, 3, 3], Float]
./data/vae.ot: encoder.conv_in.bias Tensor[[128], Float]
./data/vae.ot: encoder.down_blocks.0.resnets.0.norm1.weight Tensor[[128], Float]
./data/vae.ot: encoder.down_blocks.0.resnets.0.norm1.bias Tensor[[128], Float]
model.diffusion_model
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.0.0.weight Tensor[[320, 4, 3, 3], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.0.0.bias Tensor[[320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.0.weight Tensor[[1280, 320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.0.bias Tensor[[1280], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.2.weight Tensor[[1280, 1280], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.time_embed.2.bias Tensor[[1280], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.norm.weight Tensor[[320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.norm.bias Tensor[[320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.proj_in.weight Tensor[[320, 320, 1, 1], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.proj_in.bias Tensor[[320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn1.to_q.weight Tensor[[320, 320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn1.to_k.weight Tensor[[320, 320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn1.to_v.weight Tensor[[320, 320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn1.to_out.0.weight Tensor[[320, 320], Half]
./data/analog-diffusion-1.0.ot: model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn1.to_out.0.bias Tensor[[320], Half]
Matches unet.ot
./data/unet.ot: conv_in.weight Tensor[[320, 4, 3, 3], Float]
./data/unet.ot: conv_in.bias Tensor[[320], Float]
./data/unet.ot: time_embedding.linear_1.weight Tensor[[1280, 320], Float]
./data/unet.ot: time_embedding.linear_1.bias Tensor[[1280], Float]
./data/unet.ot: time_embedding.linear_2.weight Tensor[[1280, 1280], Float]
./data/unet.ot: time_embedding.linear_2.bias Tensor[[1280], Float]
./data/unet.ot: down_blocks.0.attentions.0.norm.weight Tensor[[320], Float]
./data/unet.ot: down_blocks.0.attentions.0.norm.bias Tensor[[320], Float]
./data/unet.ot: down_blocks.0.attentions.0.proj_in.weight Tensor[[320, 320, 1, 1], Float]
./data/unet.ot: down_blocks.0.attentions.0.proj_in.bias Tensor[[320], Float]
./data/unet.ot: down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_q.weight Tensor[[320, 320], Float]
./data/unet.ot: down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_k.weight Tensor[[320, 320], Float]
./data/unet.ot: down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_v.weight Tensor[[320, 320], Float]
./data/unet.ot: down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0.weight Tensor[[320, 320], Float]
./data/unet.ot: down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0.bias Tensor[[320], Float]
Now, looking at that, how would we actually load this in? We are using load
which has a file for each of these, but how would we load all of them at the same time and convert all the parameters over? They use nn:VarStore
but not exactly sure how it would work to load these parameters (load
or load_from_stream
, or load_partial
?).
Not sure to understand the full context here but my guess is that the easiest would be on the python side to perform the renaming and write separate .npz
files for each bit and then use tensor-tools to convert each of these to a .ot
file.
There is some preliminary support for loading checkpoints saved from python at the tch
level but sadly it's not fully working yet, see https://github.com/LaurentMazare/tch-rs/issues/595 for more details.
That sounds good! I will write a python script to convert it over to the various .npz
files and report back. Subscribed to that issue for any developments on that front.
I was working through making a conversion script when I realized that someone might have already made a script to convert. So went to google and found that diffusers has conversion scripts for the various formats. The scripts here should be good to convert over any of the other versions into a diffusers version.
Since we want individual files, I will be trying to get the script to run in 1 step instead of 2 (instead of making a .ckpt
and then extracting out the parts. )
The conversion scripts take your checkpoint and extracts it into a hugging face diffusers format with many folders, configuration json files.
feature_extractor model_index.json safety_checker scheduler text_encoder tokenizer unet vae
Which then has the extracted unet
(unet/diffusion_pytorch_model.bin
), vae
(vae/diffusion_pytorch_model.bin
) that can then be converted over to the .npz
and then into the .ot
.
Creates files we don't need in many cases like the text_encoder (for general Stable Diffusion uses CLIP or OpenCLIP), safety_checker (which detects NSFW, and other potentially harmful content) which can produce another couple of gigs of files.
It creates a 2-step process to make the hugging face version and then convert the .npz
over and then convert that over to .ot
.
For me, I need the fp16 unet to fit onto my GPU so this would require another step of converting a fp32 over, which I haven't gotten working yet.
Many models seem to use this conversion to put them up on hugging face, so we can just download these files from their extraction and use the unet and vae parts. Then a matter of converting to .npz
and .ot
there.
So for me the biggest issue is converting it from fp32 to fp16 at the moment and then the rest seems doable. After I get this working, I will write a little walkthrough for other users to utilize.
Would it be possible to use models based on the CompVis style used by stabilityai and supported in HF diffusers? My personal goals are:
I tried the following to convert the file over, and got the names of the tensors using the tensor tools. Maybe these can be extracted and compiled back together?
Full list analog-diffusion-1.0.ot.log
Thanks!