Open davidw0311 opened 2 months ago
This looks alright, probably change to NHWC. See the code: https://github.com/drawthingsai/draw-things-community/blob/main/Libraries/SwiftDiffusion/Sources/TextEncoder.swift#L831
Also, if there is NO_BACKEND, probably see if build --config=enable=mps
is set in your .bazelrc
(it should be set already in .bazelrc.darwin
, just in case. As long as you run ./Scripts/install.sh
, these are all set for you.
Hello, here's my attempt to run a text-to-image generation example following the code from swift-diffusion
import Swift
import Foundation
import SwiftUI
import CoreGraphics
import PNG
import NNC
import Diffusion
import Accelerate
public typealias UseFloatingPoint = Float16
let textEncoderPath = "/Users/davidw/models/sd-v1.4.ckpt"
let vocabPath = "/Users/davidw/models/vocab.json"
let mergesPath = "/Users/davidw/models/merges.txt"
DynamicGraph.setSeed(123456)
DynamicGraph.logLevel = .verbose
DynamicGraph.memoryEfficient = true
let graph = DynamicGraph()
graph.workspaceSize = 1_024 * 1_024 * 1_024
let prompt = "a cute golden retreiver in the style of van gogh"
let maxLength = 77
let tokenizer = CLIPTokenizer(vocabulary: vocabPath, merges: mergesPath)
let tokens = tokenizer.tokenize(text: prompt, truncation: true, maxLength: maxLength).1
let uncondTokens = tokenizer.tokenize(text: "", truncation: true, maxLength: maxLength).1
let positionTensor = graph.variable(.CPU, .C(2 * maxLength), of: Int32.self)
let tokensTensor = graph.variable(.CPU, .C(2 * maxLength), of: Int32.self)
for i in 0..<maxLength {
tokensTensor[i] = uncondTokens[i]
tokensTensor[i + maxLength] = tokens[i]
positionTensor[i] = Int32(i)
positionTensor[i + maxLength] = Int32(i)
}
let textModel = TextEncoder<UseFloatingPoint>(
filePaths: [textEncoderPath],
version: .v1,
usesFlashAttention: true,
injectEmbeddings: false,
externalOnDemand: false,
maxLength: maxLength,
clipSkip: 0,
lora: []
)
let (encoding, _) = textModel.encode(
tokens: [tokensTensor],
positions: [positionTensor],
mask: [],
injectedEmbeddings: [],
image: [],
lengthsOfUncond: [maxLength],
lengthsOfCond: [maxLength],
textModels: []
)
print(encoding)
when I execute this I get error:
CCV_NNC_DATA_TRANSFER_FORWARD: [1] -> [1]
|-> 1. 0x151e059c0 (0x151e05a40:0) [154] 49406 49407 49407 ..
|<- 1. 0x6000024c93b0 (0x151e07ba0:0) [154] 49406 49407 49407 ..
CCV_NNC_DATA_TRANSFER_FORWARD: [1] -> [1]
|-> 1. 0x151e06300 (0x151e06380:0) [154] 0 1 2 ..
|<- 1. 0x6000024c8d20 (0x151e08230:0) [154] 0 1 2 ..
Swift/ContiguousArrayBuffer.swift:600: Fatal error: Index out of range
I am wondering where am I going wrong with my implementation? Thank you!
I think it is this thing:
let (encoding, _) = textModel.encode(
tokens: [tokensTensor],
positions: [positionTensor],
mask: [],
injectedEmbeddings: [],
image: [],
lengthsOfUncond: [maxLength],
lengthsOfCond: [maxLength],
textModels: [nil]
)
The textModels
expect to contain 1 value (SDXL: 2 values).
Thank you! That does solve the error, but now I am getting
CCV_NNC_GEMM_FORWARD [8]: [3] -> [1] (2)
Wait: (2, 2)
|-> 1. 0x1301c80e0 (0x12ce45bc0:0) [154x768] 1.156250 -0.122498 0.021973 ..
|-> 2. 0x1301d4160 (0x12ce14c50:0) [768x768] -0.079773 -0.075256 -0.007603 ..
|-> 3. 0x1301d41d0 (0x12ce14dc0:0) [768] 0.000000 0.000000 0.000000 ..
|<- 1. 0x1301c8310 (0x12ce46650:0) [154x768] -0.433350 -0.454834 -0.808105 ..
Emit: (2, 4)
CCV_NNC_SCALED_DOT_PRODUCT_ATTENTION_FORWARD [9]: [6] -> [3] (0)
Wait: (0, 3), (0, 4)
|-> 1. 0x1301de960 (0x12ce46370:0) [2x77x12x64] 0.651855 0.182739 0.683105 ..
|-> 2. 0x1301de9d0 (0x12ce464e0:0) [2x77x12x64] -1.309570 -0.754883 -1.266602 ..
|-> 3. 0x1301dea40 (0x12ce46650:0) [2x77x12x64] -0.433350 -0.454834 -0.808105 ..
|-> 4. 0x1301d3d00 (0x11ce05f80:0) [2x1x77x77] 0.000000 -65504.000000 -65504.000000 ..
|-> 5. 0x1301d4240 (0x12ce14f30:0) [768x768] 0.061432 -0.039032 -0.016922 ..
|-> 6. 0x1301d42b0 (0x12ce150a0:0) [768] 0.000000 0.000000 0.000000 ..
Assertion failed: (backend != CCV_NNC_NO_BACKEND), function ccv_nnc_cmd_exec, file ccv_nnc_cmd.c, line 682.
this is my .bazelrc.darwin:
common:mps --define=enable_mps=true
common --disk_cache=.cache
build --cxxopt='-std=c++17'
build --config=mps
build --features=swift.use_global_module_cache
build --strategy=SwiftCompile=worker
build --features=swift.enable_batch_mode
common:release --define=enable_mps=true
common:release --swiftcopt=-whole-module-optimization
common:release --compilation_mode=opt
common:release --apple_generate_dsym
try-import %workspace%/.bazelrc.local
and my .bazelrc:
try-import %workspace%/.bazelrc.darwin
Try this:
let positionTensor = graph.variable(.CPU, format: .NHWC, shape: [2 * maxLength], of: Int32.self)
let tokensTensor = graph.variable(.CPU, format: .NHWC, shape: [2 * maxLength], of: Int32.self)
SDAP only accepts NHWC shape of tensor, and these are inherited from the input tensor.
Thanks! That solves the error :)
Adapting code from swift-diffusion, I am trying to generate an image using:
let (c, _) = textModel.encode(
tokens: [tokensTensor],
positions: [positionTensor],
mask: [],
injectedEmbeddings: [],
image: [],
lengthsOfUncond: [maxLength],
lengthsOfCond: [maxLength],
textModels: [nil]
)
let startWidth = 64
let startHeight = 64
let generationSteps = 25
let unconditionalGuidanceScale: Float = 7.5
let scaleFactor: Float = 0.18215
let model = DiffusionModel(linearStart: 0.00085, linearEnd: 0.012, timesteps: 1_000, steps: generationSteps)
let alphasCumprod = model.alphasCumprod
let sigmasForTimesteps = DiffusionModel.sigmas(from: alphasCumprod)
let alphas = alphasCumprod.map { $0.squareRoot() }
let sigmas = alphasCumprod.map { (1 - $0).squareRoot() }
let lambdas = zip(alphas, sigmas).map { log($0) - log($1) }
let (unet, _) = UNet(
batchSize: 2, embeddingLength: (77, 77), startWidth: startWidth, startHeight: startHeight,
usesFlashAttention: .scale1, injectControls: false, injectT2IAdapters: false,
injectIPAdapterLengths: [0]
)
let (decoder, _, _) = Decoder(
channels: [128, 256, 512, 512], numRepeat: 2, batchSize: 1, startWidth: startWidth,
startHeight: startHeight, usesFlashAttention: true, paddingFinalConvLayer: false)
var timestepList = [Int]()
var outputList = [DynamicGraph.Tensor<UseFloatingPoint>]()
let startTime = Date()
var lastSample: DynamicGraph.Tensor<UseFloatingPoint>? = nil
let x_T = graph.variable(.GPU(0), .NHWC(1, startHeight, startWidth, 4), of: UseFloatingPoint.self)
x_T.randn(std: 1, mean: 0)
var x = x_T
var xIn = graph.variable(.GPU(0), .NHWC(2, startHeight, startWidth, 4), of: UseFloatingPoint.self)
for i in 0..<model.steps {
let timestep = model.timesteps - model.timesteps / model.steps * i - 1
let ts = timeEmbedding(timestep: Float(timestep), batchSize: 2, embeddingSize: 320, maxPeriod: 10_000)
.toGPU(0)
let t = graph.variable(Tensor<UseFloatingPoint>(from: ts))
xIn[0..<1, 0..<startHeight, 0..<startWidth, 0..<4] = x
xIn[1..<2, 0..<startHeight, 0..<startWidth, 0..<4] = x
var et = unet(inputs: xIn, t, c[0])[0].as(of: UseFloatingPoint.self)
var etUncond = graph.variable(
.GPU(0), .NHWC(1, startHeight, startWidth, 4), of: UseFloatingPoint.self)
var etCond = graph.variable(
.GPU(0), .NHWC(1, startHeight, startWidth, 4), of: UseFloatingPoint.self)
etUncond[0..<1, 0..<startHeight, 0..<startWidth, 0..<4] =
et[0..<1, 0..<startHeight, 0..<startWidth, 0..<4]
etCond[0..<1, 0..<startHeight, 0..<startWidth, 0..<4] =
et[1..<2, 0..<startHeight, 0..<startWidth, 0..<4]
et = etUncond + unconditionalGuidanceScale * (etCond - etUncond)
// UniPC sampler.
let mt = Functional.add(
left: x, right: et, leftScalar: 1.0 / alphas[timestep],
rightScalar: -sigmas[timestep] / alphas[timestep])
let useCorrector = lastSample != nil
if useCorrector, let lastSample = lastSample {
x = uniCBhUpdate(
mt: mt, timestep: timestep, lastSample: lastSample, timestepList: timestepList,
outputList: outputList, lambdas: lambdas, alphas: alphas, sigmas: sigmas)
}
if timestepList.count < 2 {
timestepList.append(timestep)
} else {
timestepList[0] = timestepList[1]
timestepList[1] = timestep
}
if outputList.count < 2 {
outputList.append(mt)
} else {
outputList[0] = outputList[1]
outputList[1] = mt
}
let prevTimestep = max(0, model.timesteps - model.timesteps / model.steps * (i + 1) - 1)
lastSample = x
x = uniPBhUpdate(
mt: mt, prevTimestep: prevTimestep, sample: x, timestepList: timestepList,
outputList: outputList, lambdas: lambdas, alphas: alphas, sigmas: sigmas)
}
let z = 1.0 / scaleFactor * x
let img = DynamicGraph.Tensor<Float>(from: decoder(inputs: z)[0].as(of: UseFloatingPoint.self))
.toCPU()
not sure if this is the correct way, but I am getting
CCV_NNC_RANDOM_NORMAL_FORWARD: [0] -> [1]
|<- 1. 0x600000039490 (0x135e23a70:0) [1x64x64x4] -1.528320 -0.254395 -0.121826 ..
CCV_NNC_FORMAT_TRANSFORM_FORWARD: [1] -> [1]
|-> 1. 0x600000039490 (0x135e23a70:0) [1x64x64x4] -1.528320 -0.254395 -0.121826 ..
|<- 1. 0x60000002f480 (0x135e68770:0) [1x64x64x4] -1.528320 -0.254395 -0.121826 ..
CCV_NNC_FORMAT_TRANSFORM_FORWARD: [1] -> [1]
|-> 1. 0x600000039490 (0x135e23a70:0) [1x64x64x4] -1.528320 -0.254395 -0.121826 ..
|<- 1. 0x600001f8c000 (0x135e68770:0) [1x64x64x4] -1.528320 -0.254395 -0.121826 ..
Assertion failed: (input_size == model->input_size || model->input_size == 0), function ccv_cnnp_model_compile, file ccv_cnnp_model.c, line 573.
at the line
var et = unet(inputs: xIn, t, c[0])[0].as(of: UseFloatingPoint.self)
How can I make this work?
Thanks!
I wonder if this is the more correct implementation, I am trying to initialize a UNetFromNNC object and compiling
var unet = UNetFromNNC<UseFloatingPoint>()
let x_T = graph.variable(.GPU(0), .NHWC(2, startHeight, startWidth, 4), of: UseFloatingPoint.self)
x_T.randn(std: 1, mean: 0)
let timestep = timeEmbedding(timestep: 0, batchSize: 2, embeddingSize: 320, maxPeriod: 10_000).toGPU(0)
let timestepTensor = graph.variable(Tensor<UseFloatingPoint>(from: timestep))
unet.compileModel(
filePath: unetPath, externalOnDemand: true, version: .v1, upcastAttention: true,
usesFlashAttention: true, injectControls: false, injectT2IAdapters: false,
injectIPAdapterLengths: [0], lora: [],
is8BitModel: false, canRunLoRASeparately: false,
inputs: x_T, timestepTensor, c,
tokenLengthUncond: 77, tokenLengthCond: 77,
extraProjection: nil,
injectedControls: [],
injectedT2IAdapters: [],
injectedIPAdapters: []
)
but run into the error:
CCV_NNC_LAYER_NORM_FORWARD [161]: [3] -> [3] (0)
|-> 1. 0x128025780 (0x141826580:0) [154x768] -5.328125 -1.828125 -4.781250 ..
|-> 2. 0x1280312c0 (0x140718c60:0) [1x768] 0.259766 0.989258 0.238281 ..
|-> 3. 0x128031330 (0x140718dd0:0) [1x768] 0.000000 0.000000 0.000000 ..
|<- 1. 0x12802bd70 (0x1407072f0:0) [154x768] -0.263672 -0.271240 -0.214355 ..
|<- 2. 0x1280257f0 (0x1418266f0:0) [154x1] -0.531738 ..
|<- 3. 0x128025860 (0x141826860:0) [154x1] 0.211670 ..
Graph Stream 0 End
|<- 1. 0x600000a64540 (0x1407072f0:0) [154x768] -0.263672 -0.271240 -0.214355 ..
CCV_NNC_RANDOM_NORMAL_FORWARD: [0] -> [1]
|<- 1. 0x6000009e6530 (0x140605f90:0) [2x64x64x4] -1.528320 -0.254395 -0.121826 ..
Assertion failed: (input_size == model->input_size || model->input_size == 0), function ccv_cnnp_model_compile, file ccv_cnnp_model.c, line 573.
injectIPAdapterLengths: [0], lora: [],
Should be
injectIPAdapterLengths: [], lora: [],
Otherwise it will be treated as having IPAdapter tensor injected as input: https://github.com/drawthingsai/draw-things-community/blob/c0b21b67ffb16212bdf44e4159f26cc251cbdbd7/Libraries/SwiftDiffusion/Sources/Models/UNet.swift#L208
Also, you can see how we use it in this file: https://github.com/drawthingsai/draw-things-community/blob/c0b21b67ffb16212bdf44e4159f26cc251cbdbd7/Libraries/ImageGenerator/Sources/ImageGenerator.swift
Thanks so much! Would you be able to provide a script for ie loading in the stable-diffusion 1.5 checkpoint and performing text to image generation using the provided script?
Hi! It probably will be low on the list of things to do. If you can use ImageGenerator class, it is easier. For example, if you first setup the ModelZoo path correctly (i.e. the app required models, downloaded from https://static.libnnc.org/modelname, or in the app's Models container directory) following this line: https://github.com/drawthingsai/draw-things-community/blob/main/Apps/ModelConverter/Converter.swift#L34
Then you can simply call ImageGenerator to do text2img:
let imageGenerator = ImageGenerator(
queue: queue, configurations: configurations, workspace: workspace, tokenizerV1: tokenizerV1,
tokenizerV2: tokenizerV2, tokenizerXL: tokenizerXL, tokenizerKandinsky: tokenizerKandinsky,
poseDrawer: DefaultPoseDrawer())
let (tensors, scale) = self.imageGenerator.generate(
nil, scaleFactor: 1, mask: nil,
depth: nil,
hints: [:], custom: nil, shuffles: [], text: prompt,
negativeText: negativePrompt,
configuration: configuration
) { signpost, signposts, tensor in
return true
}
Hi, I am wondering if a minimal woking example could be provided for running text to image? I have been trying to execute the code following a similar approach as in swift-diffusion on my mac mini M2, but get the error
Assertion failed: (backend != CCV_NNC_NO_BACKEND), function ccv_nnc_cmd_exec, file ccv_nnc_cmd.c, line 682.
after compiling the textModel and running:
Thanks!