TensorStack-AI / OnnxStack

C# Stable Diffusion using ONNX Runtime
Apache License 2.0
221 stars 33 forks source link

Add integration tests #25

Closed james-s-tayler closed 1 year ago

james-s-tayler commented 1 year ago

Hi,

I've been working on a PR to add some integration tests into OnnxStackCore.sln, so that I have a repeatable way to test and run the functionality that isn't coupled to a particular UI implementation, to guard against regressions as development continues, and to have a way to begin contributing new things.

However, there are some problems I have encountered.

First was that for whatever reason, the tests run fine inside the Docker container, but not on my local ubuntu installation. I figured out it was due to some whacky bug in the OrtRuntime.Extensions library where it's trying to resolve the name for the dll's it needs to call when registering the custom operations, and those files have been renamed at some point, so it claims that it can't find ortextensions but in actual fact the file name changed to libortextensions.so and it needs that. Yet for some magical reason it just works when I run it inside the Docker container. Anyway, don't worry about that, since I can run it inside Docker I'm not so fussed on trying to solve that problem for now, just thought I would mention it.

Second, and more importantly... I had both tests in this PR running perfectly and passing before merging the current master branch which contained 15 new commits. I think those changes might have actually broke something?

The error that I now get when I run the tests is as follows:

yolo@pop-os:~/source/OnnxStack$ docker-compose up --build
Building app
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  79.35MB
Step 1/8 : FROM mcr.microsoft.com/dotnet/sdk:7.0 AS build
 ---> 889872ffeee7
Step 2/8 : WORKDIR /app
 ---> Using cache
 ---> d640a6580dcb
Step 3/8 : RUN apt-get update && apt-get install -y curl
 ---> Using cache
 ---> 11e68392f874
Step 4/8 : RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && apt-get install -y git-lfs
 ---> Using cache
 ---> 0d9beb7a3096
Step 5/8 : RUN git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 -b onnx
 ---> Using cache
 ---> eea1b547a38c
Step 6/8 : COPY . .
 ---> Using cache
 ---> da9d2216040c
Step 7/8 : RUN dotnet build OnnxStackCore.sln
 ---> Using cache
 ---> 504b6c483a04
Step 8/8 : ENTRYPOINT ["dotnet", "test", "OnnxStackCore.sln"]
 ---> Using cache
 ---> 8ceca5b73ff3
Successfully built 8ceca5b73ff3
Successfully tagged onnxstack_app:latest
Starting onnxstack_app_1 ... done
Attaching to onnxstack_app_1
app_1  |   Determining projects to restore...
app_1  |   All projects are up-to-date for restore.
app_1  |   OnnxStack.Core -> /app/OnnxStack.Core/bin/Debug/net7.0/OnnxStack.Core.dll
app_1  |   OnnxStack.StableDiffusion -> /app/OnnxStack.StableDiffusion/bin/Debug/net7.0/OnnxStack.StableDiffusion.dll
app_1  |   OnnxStack.IntegrationTests -> /app/OnnxStack.IntegrationTests/bin/Debug/net7.0/OnnxStack.IntegrationTests.dll
app_1  | Test run for /app/OnnxStack.IntegrationTests/bin/Debug/net7.0/OnnxStack.IntegrationTests.dll (.NETCoreApp,Version=v7.0)
app_1  | Microsoft (R) Test Execution Command Line Tool Version 17.7.2 (x64)
app_1  | Copyright (c) Microsoft Corporation.  All rights reserved.
app_1  | 
app_1  | Starting test execution, please wait...
app_1  | A total of 1 test files matched the specified pattern.
app_1  | info: OnnxStack.IntegrationTests.StableDiffusionTests[0]
app_1  |       Attempting to load model StableDiffusion 1.5
app_1  | info: OnnxStack.IntegrationTests.StableDiffusionTests[0]
app_1  |       Attempting to load model StableDiffusion 1.5
app_1  | info: OnnxStack.StableDiffusion.Diffusers.StableDiffusion.StableDiffusionDiffuser[0]
app_1  |       [DiffuseAsync] - Begin...
app_1  | info: OnnxStack.StableDiffusion.Diffusers.StableDiffusion.StableDiffusionDiffuser[0]
app_1  |       [DiffuseAsync] - Model: StableDiffusion 1.5, Pipeline: StableDiffusion, Diffuser: TextToImage, Scheduler: EulerAncestral
app_1  | [xUnit.net 00:00:51.53]     OnnxStack.IntegrationTests.StableDiffusionTests.GivenTextToImage_WhenInference_ThenImageGenerated [FAIL]
app_1  |   Failed OnnxStack.IntegrationTests.StableDiffusionTests.GivenTextToImage_WhenInference_ThenImageGenerated [28 s]
app_1  |   Error Message:
app_1  |    System.AggregateException : One or more errors occurred. ([ErrorCode:RuntimeException] Non-zero status code returned while running ReorderOutput node. Name:'ReorderOutput_token_942' Status Message: /onnxruntime_src/onnxruntime/core/framework/execution_frame.cc:171 onnxruntime::common::Status onnxruntime::IExecutionFrame::GetOrCreateNodeOutputMLValue(int, int, const onnxruntime::TensorShape*, OrtValue*&, const onnxruntime::Node&) shape && tensor.Shape() == *shape was false. OrtValue shape verification failed. Current shape:{1,4,64,64} Requested shape:{2,4,64,64}
app_1  | )
app_1  | ---- Microsoft.ML.OnnxRuntime.OnnxRuntimeException : [ErrorCode:RuntimeException] Non-zero status code returned while running ReorderOutput node. Name:'ReorderOutput_token_942' Status Message: /onnxruntime_src/onnxruntime/core/framework/execution_frame.cc:171 onnxruntime::common::Status onnxruntime::IExecutionFrame::GetOrCreateNodeOutputMLValue(int, int, const onnxruntime::TensorShape*, OrtValue*&, const onnxruntime::Node&) shape && tensor.Shape() == *shape was false. OrtValue shape verification failed. Current shape:{1,4,64,64} Requested shape:{2,4,64,64}
app_1  | 
app_1  |   Stack Trace:
app_1  |      at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
app_1  |    at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
app_1  |    at OnnxStack.StableDiffusion.Services.StableDiffusionService.<>c.<GenerateAsImageAsync>b__10_0(Task`1 t) in /app/OnnxStack.StableDiffusion/Services/StableDiffusionService.cs:line 107
app_1  |    at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
app_1  |    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
app_1  | --- End of stack trace from previous location ---
app_1  |    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
app_1  |    at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
app_1  | --- End of stack trace from previous location ---
app_1  |    at OnnxStack.StableDiffusion.Services.StableDiffusionService.GenerateAsImageAsync(IModelOptions model, PromptOptions prompt, SchedulerOptions options, Action`2 progressCallback, CancellationToken cancellationToken) in /app/OnnxStack.StableDiffusion/Services/StableDiffusionService.cs:line 106
app_1  |    at OnnxStack.IntegrationTests.StableDiffusionTests.GivenTextToImage_WhenInference_ThenImageGenerated() in /app/OnnxStack.IntegrationTests/StableDiffusionTests.cs:line 83
app_1  | --- End of stack trace from previous location ---
app_1  | ----- Inner Stack Trace -----
app_1  |    at Microsoft.ML.OnnxRuntime.InferenceSession.<>c__DisplayClass75_0.<RunAsync>b__0(IReadOnlyCollection`1 outputs, IntPtr status)
app_1  | --- End of stack trace from previous location ---
app_1  |    at Microsoft.ML.OnnxRuntime.InferenceSession.RunAsync(RunOptions options, IReadOnlyCollection`1 inputNames, IReadOnlyCollection`1 inputValues, IReadOnlyCollection`1 outputNames, IReadOnlyCollection`1 outputValues)
app_1  |    at OnnxStack.StableDiffusion.Diffusers.StableDiffusion.StableDiffusionDiffuser.SchedulerStep(IModelOptions modelOptions, PromptOptions promptOptions, SchedulerOptions schedulerOptions, DenseTensor`1 promptEmbeddings, Boolean performGuidance, Action`2 progressCallback, CancellationToken cancellationToken) in /app/OnnxStack.StableDiffusion/Diffusers/StableDiffusion/StableDiffusionDiffuser.cs:line 91
app_1  |    at OnnxStack.StableDiffusion.Diffusers.DiffuserBase.DiffuseAsync(IModelOptions modelOptions, PromptOptions promptOptions, SchedulerOptions schedulerOptions, Action`2 progressCallback, CancellationToken cancellationToken) in /app/OnnxStack.StableDiffusion/Diffusers/DiffuserBase.cs:line 116
app_1  |    at OnnxStack.StableDiffusion.Services.StableDiffusionService.DiffuseAsync(IModelOptions modelOptions, PromptOptions promptOptions, SchedulerOptions schedulerOptions, Action`2 progress, CancellationToken cancellationToken) in /app/OnnxStack.StableDiffusion/Services/StableDiffusionService.cs:line 220
app_1  |    at OnnxStack.StableDiffusion.Services.StableDiffusionService.GenerateAsync(IModelOptions model, PromptOptions prompt, SchedulerOptions options, Action`2 progressCallback, CancellationToken cancellationToken) in /app/OnnxStack.StableDiffusion/Services/StableDiffusionService.cs:line 92
app_1  |   Standard Output Messages:
app_1  |  2023-11-13T12:11:15.4959035+00:00 - Information - 0 - OnnxStack.IntegrationTests.StableDiffusionTests - Attempting to load model StableDiffusion 1.5
app_1  |  2023-11-13T12:11:28.0413562+00:00 - Information - 0 - OnnxStack.StableDiffusion.Diffusers.StableDiffusion.StableDiffusionDiffuser - [DiffuseAsync] - Begin...
app_1  |  2023-11-13T12:11:28.0425114+00:00 - Information - 0 - OnnxStack.StableDiffusion.Diffusers.StableDiffusion.StableDiffusionDiffuser - [DiffuseAsync] - Model: StableDiffusion 1.5, Pipeline: StableDiffusion, Diffuser: TextToImage, Scheduler: EulerAncestral
app_1  | 
app_1  | 
app_1  | 
app_1  | Failed!  - Failed:     1, Passed:     1, Skipped:     0, Total:     2, Duration: 28 s - OnnxStack.IntegrationTests.dll (net7.0)
onnxstack_app_1 exited with code 1

Did y'all break something, or do I need to update how I'm calling OnnxStack in order to fix my test? But, also that likely means anyone who was calling it this way and updated to a newer version would likely be experiencing the same exception, right?

I can see it's complaining the tensor shape being different...

saddam213 commented 1 year ago

Opps, my bad, seems I broke it for SD models, have commited a fix

I moved the codebase over to the new OnnxRuntime OrtValue API, was a large change and I missed this in my testing as I used LCM which does not do guidance so the error didnt show until I use the model you tried :/

Regarding OrtExtensions, I have noticed a bit of issue online about this, does not seem to be a windows issue, but a mac and linux one, one thing I do know is the app MUST be x64 or it wont work at all, as Mircosoft.ML is x64 only

This repo is new and does change rapidly, so sorry if I break your tests, still tiring to figure our the best way to structure this application as I find new cool things to add, so bare with me :p

saddam213 commented 1 year ago

If you publish the linux as self-contained it "should" run without issue

saddam213 commented 1 year ago

Tests look great, have not even had a chance to a one yet, so this is awesome

one small thing I noticed

services.AddOnnxStack();
services.AddOnnxStackStableDiffusion();

AddOnnxStackStableDiffusion calls AddOnnxStack internally so no need to call both

james-s-tayler commented 1 year ago

Awesome! Thanks, yeah now that I've pulled the latest master the tests are indeed passing :)

I'll try add an LCM test in as-well, so all bases are covered.

saddam213 commented 1 year ago

Awesome! Thanks, yeah now that I've pulled the latest master the tests are indeed passing :)

I'll try add an LCM test in as-well, so all bases are covered.

It might be easier to do a test with GuidanceScale set to1f or below, as this would simulate a Model with that issue

james-s-tayler commented 1 year ago

Given these tests are running on CPU execution provider they run super slow, so my current goal at the moment is just get the most minimal set of happy path test cases (do the models load? can we generate an image consistently?) and get that merged. Then look at getting the docker containers reworked, so that the can leverage the Nvidia one that allows for GPU pass through and get the ability to run the test suite much faster, and then work through adding more comprehensive coverage.

saddam213 commented 1 year ago

That would be awesome, appreciate any tests added.

I could setup a local server here in CHCH with some GPUs if that's easier? your NZ right?

james-s-tayler commented 1 year ago

Oh snap, you're in NZ too! Nice! Didn't see that. Yeah, I'm up in Auckland.

I've got a 4090, so can run them plenty fast locally once the devops side of things supports it, but just need to work through that piece by piece. Ideally, the trajectory is getting the test suite running via hardware acceleration inside a CI/CD pipeline to ensure the integrity of the project as new functionality as developed. Keen also to make it as accessible/friendly as possible in terms of local developer experience to maximize ease of contribution.

saddam213 commented 1 year ago

4090 would be nice, I have a 3090 in my dev but only have P100 a T4 and 2 M40's in my servers, even combined they are not even close to your compute power

saddam213 commented 1 year ago

would you like me to merge this one in now, or would you like to add to this one?

james-s-tayler commented 1 year ago

I'm still adding to this one. I'm just about to push up the last commit on it since it looks like I've got the LCM tests working now too :) Once that is pushed I will let you know and it'll be ready to merge.

saddam213 commented 1 year ago

sweet as

james-s-tayler commented 1 year ago

Done! Should be ready to merge now.

saddam213 commented 1 year ago

thanks man!!