Vertexwahn / rules_oidn

Bazel rules for Intel Open Image Denoise
https://vertexwahn.de/page/open_source/
Apache License 2.0
1 stars 1 forks source link

Black stripes in denoised image #1

Closed Vertexwahn closed 1 year ago

Vertexwahn commented 1 year ago

Note: There is a $500.00USD bounty on this issue

Apparently, there are black stripes in the denoised image: image

Currently, the root cause for this is unclear. I opened a question about this on StackOverflow with no success so far.

I document also some of my doings and findings in some Medium posts:

How to reproduce

git clone https://github.com/Vertexwahn/rules_oidn.git
cd rules_oidn
cd tests

Run example with Ubuntu 22.04:

bazel run --config=gcc11 //:example

Run example with Visual Studio 2022:

bazel run --config=vs2022 //:example

The example takes a noisy image and denoises it.

...
int main() {
    cout << "Simple denoising example" << endl;

    Image3f color = load_image_openexr("data/cornel_box.naive_diffuse.box_filter.spp128.embree.exr");
    ...
    Image3f out{color.width(), color.height()};

    ...

    float* colorPtr = color.data();
    ...
    float* outputPtr = out.data();
    int width = out.width();
    int height = out.height();

    oidn::DeviceRef device = oidn::newDevice();
    device.set("verbose", 1);
    device.commit();

    // Create a filter for denoising a beauty (color) image using optional auxiliary images too
    oidn::FilterRef filter = device.newFilter("RT"); // generic ray tracing filter
    filter.setImage("color",  colorPtr,  oidn::Format::Float3, width, height); // beauty
    ...
    filter.setImage("output", outputPtr, oidn::Format::Float3, width, height); // denoised beauty
    filter.set("hdr", true); // beauty image is HDR
    filter.commit();

    // Filter the image
    filter.execute();

    // Check for errors
    const char* errorMessage;
    if (device.getError(errorMessage) != oidn::Error::None) {
        std::cout << "Error: " << errorMessage << std::endl;
    }

    store_open_exr("denoised.exr", out);

    return 0;
}

Unfortunately, the denoised image contains black stripes.

I tested the same input with https://github.com/DeclanRussell/IntelOIDenoiser and got the expected result (without black stripes). Therefore, the problem must be within my bazalization of OIDN or my supporting around it.

If I choose a constant color image, e.g.

// for debug reasons the color image can be initialized with a const color
if(true) { 
    for (int x = 0; x < color.width(); ++x) {
        for (int y = 0; y < color.height(); ++y) {
            color.set_pixel(x,y,.5f, .5f, .5f);
        }
    }
}

I also get black stripes.

I also created a branch of oidn that contains a "direct" bazelization of oidn which is almost similar to rules_oidn: https://github.com/Vertexwahn/oidn/tree/add-bazel-support. In this branch I bazelized oidnTest which contains a few test cases which all pass with success. I bazelized also oidnDenoiser. You can run it via:

# the_cornell_box.pfm is in data folder in the oidn repo
bazel run --config=gcc11 //:oidnDenoise -- --hdr /home/vertexwahn/Desktop/the_cornell_box.pfm -o /home/vertexwahn/Desktop/denoised.pfm

The generated file denoised.pfm shows the same black stripes. It seems that the black stripes are always 5 pixel wide followed by 3 correct looking color stripes.

Not sure if there are any problems with handing over memory to ISPC, or doing the Filter Operation, etc. Since I tested different image formats (OpenEXR, PFM), I assume that the error is not within my way of storing images.

Expected fix

The expected fix should identify the root cause (maybe ispc_rules, some memory layout somewhere, etc.) and provide PRs to fix it. The denoised image should look identical to the CMake build. Could be that the fix will be less than 10 lines in code - maybe single line change. In the end rules_oidn should work out of the box on Windows and Ubuntut 22.04 (as it does currently - but without the black stripes) - macOS is not expected to work since this requires a different version of oneDNN which is out of the scope of this issue.

parvit commented 1 year ago

Hi, i was also able to reproduce the issue, i'm in the process of debugging. Only thing i was able to extract to this moment is that the issue is specifically in the filter execution. I'll let you know if i get more details for a possible fix.

parvit commented 1 year ago

Hi, i've extracted the image data after the execution of the first layer of the network in both cases, there seems to be a sharp difference between the outputs.

I've confirmed that the tza files used by both are the same, but could this indicate an error in the loading of the weights?

== CMAKE == cmake_odin

== BAZEL == bazel_odin

parvit commented 1 year ago

Hi, i think i'm zeroing in on the issue, i've changed both the versions of InputReorder_kernel to just set the entire tile to a solid 1.0f value, this way: input_reorder_change

Here the results:

== CMAKE == cmake

== BAZEL == bazel

So what i think now is that is some kind of memory alignment issue that writes wrongly the calculated values.

ghost commented 1 year ago

@Vertexwahn , it looks like @parvit goes the code-level way to fix this issue, which is fine.

I go the "technical project management" way trying to strip the "bazel" requirement, because of the high effort to support an alternate (to what the project uses) build-system.

You're risking to invest more and more time in tool/build maintenance. Take a look at:

parvit commented 1 year ago

@Vertexwahn @abebeos I assumed the bazel usage was the whole point of the task, but if its not required than we could make it so the cmake generation procures and builds automtically all the dependencies.

Let me know if this is a route you are interested in, in the meantime i'll continue my investigation.

ghost commented 1 year ago

I assumed the bazel usage was the whole point of the task

It is. I usually start any task with Requirements & Feasibility analysis (the result of which will most possibly be: "MUST Requirement: Bazel). So all good, I guess.

Vertexwahn commented 1 year ago

@Vertexwahn @abebeos I assumed the bazel usage was the whole point of the task,

Exactly.

@abebeos @parvit The goal of the Bounty is to use "pure" Bazel, not CMake.

ghost commented 1 year ago

@Vertexwahn , I looked at all this, did some trials with your sources, reviewed your initial PR etc.

What you need additionally is:

Let me know if you're willing to add an additional bounty (can be again $500) for those tasks.

My estimation is that within a timeframe of 2 weeks (not fulltime of course) this can be solved.

ghost commented 1 year ago

(i would open a project, file some issues that would need action from your side, estimated 1 hour effort. And now and then a question.)

ghost commented 1 year ago

@parvit still working on this?

parvit commented 1 year ago

Yes i have found some anomalies in the tensor values but nothing conclusive yet

ghost commented 1 year ago

My attempt is abstractly, without detail involvement.

How familiar are you with bazel? (My mindset is more around maven/cmake.)

Could you attempt to bypass the unpack step as described within https://github.com/Vertexwahn/rules_oidn/issues/8 ?

ghost commented 1 year ago

@Vertexwahn , you'll find my preliminary results here:

The effort to apply those findings/changes to your sources is very high for me (as I do not know neither bazel nor your mindset/code), so I would prefer if you would take this on (as you're fully familiar with what you wrote). Should be more efficient this way. You would start with this:

And go on with the others in the list. Note that those tasks need to be applied to the source-code anyways.

As for the long run:

I still believe that the best way to solve this would be a oidn fork with bazel support (one that I would keep "clean" right from the beginning, e.g. add only the minimum needed changes, always strive to use the exact same source-code that should always function with cmake). But the effort for this is beyond a $500 bounty.

parvit commented 1 year ago

My attempt is abstractly, without detail involvement.

How familiar are you with bazel? (My mindset is more around maven/cmake.)

Could you attempt to bypass the unpack step as described within https://github.com/Vertexwahn/rules_oidn/issues/8 ?

You might have missed in my previous comments that the weight files are the binary same so i don't think that helps.

ghost commented 1 year ago

the binary same

8 verifies the tiny possibility that something went wrong during extraction. Could be verified manually, too, i guess, via file comparison.

parvit commented 1 year ago

That is exactly what i did, a binary comparison of the files.

Il Mar 14 Mar 2023, 20:00 abebeos @.***> ha scritto:

the binary same

8 https://github.com/Vertexwahn/rules_oidn/issues/8 verifies the tiny

possibility that something went wrong during extraction. Could be verified manually, too, i guess, via file comparison.

— Reply to this email directly, view it on GitHub https://github.com/Vertexwahn/rules_oidn/issues/1#issuecomment-1468666554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4EOU47DTPU543KK6CTW4C555ANCNFSM6AAAAAAVLGHDB4 . You are receiving this because you were mentioned.Message ID: @.***>

ghost commented 1 year ago

That is exactly what i did, a binary comparison of the files.

You wrote earlier:

I've confirmed that the tza files used by both are the same,

"tza" files. I'm referring to the resulting cpp/h source files

In any way, i'm awaiting @Vertexwahn here, many things need a rollback/change.

Though a full replay would be the best option.

parvit commented 1 year ago

2023-03-14 22_46_17-Window

Also those binaries are exactly the same.

parvit commented 1 year ago

Hi @Vertexwahn i have found the issue, i'll update tomorrow on the details

ghost commented 1 year ago

Also those binaries are exactly the same.

Ok, thus now we can say that

i have found the issue, i'll update tomorrow on the details

@Vertexwahn, independent of this, we should do a replay, thus you can avoid similar incidents in future

ghost commented 1 year ago

The expected fix should identify the root cause

Root cause identified as "wrong source-code version used for mkl_dnn"

My cross-tests show that those versions are not compatible.

Changing your bazel build-scripts is not trivial, though, due to the used build files from "TensorFlow".

You'll find more details on how to proceed within.

parvit commented 1 year ago

My fix is about the missed definition in ipcs_rules of the OIDN_DNNL define, so no version change necessary.

@Vertexwahn can you indicate which fix would you prefer as i have to prepare the fixes in the bazel projects, if you don't think you'll accept my solution than i can stop.

Regards,

Il Mer 15 Mar 2023, 11:34 abebeos @.***> ha scritto:

The expected fix should identify the root cause

Root cause identified as "wrong source-code version used for mkl_dnn"

My cross-tests show that those versions are not compatible.

Changing your bazel build-scripts is not trivial, though, due to the used build files from "TensorFlow".

You'll find more details on how to proceed within.

— Reply to this email directly, view it on GitHub https://github.com/Vertexwahn/rules_oidn/issues/1#issuecomment-1469756056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4ESTQ52UZNMYCCDJGDW4GLNDANCNFSM6AAAAAAVLGHDB4 . You are receiving this because you were mentioned.Message ID: @.***>

ghost commented 1 year ago

My fix is about the missed definition in ipcs_rules

if you've validated/verified your fix, the I'd just go on ('cause Vertexwahn seems to be busy with other work, no replies since days).

The "version changes" and the overall build-system replay/rewrite is anyways a necessity (= MUST requirement) to achieve production-grade bazel support. This would be independent of some direct fix. That's why I asked initially for an additional bounty within https://github.com/Vertexwahn/rules_oidn/issues/1#issuecomment-1464976927.

ghost commented 1 year ago

@Vertexwahn , #7 further confirmed (oidn uses a customized 2.4.2 version of mkl-dnn, with several differences in source-code, compared to the oneapi version 2.4.2)

Vertexwahn commented 1 year ago

@parvit

My fix is about the missed definition in ipcs_rules of the OIDN_DNNL define, so no version change necessary.

Do mean rules_oidn or rules_ispc?

I switched to oneDNN-2.7.3 in the meanwhile - but I am also fine with downgrading again.

I am curious what you mean by "missed definition in ipcs_rules of the OIDN_DNNL define" - I have no idea, to be honest - looking forward to a PR!

@Vertexwahn can you indicate which fix would you prefer as i have to prepare the fixes in the bazel projects, if you don't think you'll accept my solution than i can stop.

As stated in "Expected fix" I will accept anything that fulfills this:

The expected fix should identify the root cause (maybe ispc_rules, some memory layout somewhere, etc.) and provide PRs to fix it. The denoised image should look identical to the CMake build. Could be that the fix will be less than 10 lines in code - maybe single line change. In the end rules_oidn should work out of the box on Windows and Ubuntut 22.04 (as it does currently - but without the black stripes) - macOS is not expected to work since this requires a different version of oneDNN which is out of the scope of this issue.

As stated previously the solution should be a pure Bazel solution without the use of CMake.

The use of rules_foreign_cc should be avoided. It is expected that oneDNN is used. If the fix requires the change of other repos (e.g. rules_ispc) you should also provides those changes - either as Pull Request or patch files.

The expected fix should identify the root cause Root cause identified as "wrong source-code version used for mkl_dnn" My cross-tests show that those versions are not compatible. Changing your bazel build-scripts is not trivial, though, due to the used build files from "TensorFlow". You'll find more details on how to proceed within. - #7 <#7> — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4ESTQ52UZNMYCCDJGDW4GLNDANCNFSM6AAAAAAVLGHDB4 . You are receiving this because you were mentioned.Message ID: @.***>

You can ignore what @abebeos is saying - since I follow here the policy "He who pays the piper calls the tune". Changing bazel build scripts is maybe for @abebeos difficult, but not for me ;)

parvit commented 1 year ago

@Vertexwahn Thanks for the clarification i feel a lot more confident now about the state of the issue.

The fix on rules_ispc (sorry typo i was on mobile) is going to be pure Bazel, i'm in fact in the process of preparing the PRs on the relevant repos, unfortunately i have limited time slots to work on it so bear with me but it should be ready shortly :D

It will augment the calls to ispc_cc_library to pass a set of defines (configurable of course) which are missing (config.h is not included in the source headers so they are passed on the command line of ispc) and without those the wrong index calculations happen in the tensor.isph which cause the issue.

Fixing that then produces the same output on both builds so i'm confident you'll be happy with the result, i'll update as soon as i'm ready.

Regards,

Vertexwahn commented 1 year ago

@parvit Sounds awesome!

ghost commented 1 year ago

You can ignore what @abebeos is saying - since I follow here the policy "He who pays the piper calls the tune".

No need for such statements, my comments were always either neutral or encouraging.

Changing bazel build scripts is maybe for @abebeos difficult, but not for me ;)

It's not about difficulty, but about reducing effort. Producing a simple to understand and to maintain result is the difficult (and time consuming) task.

So, even if @parvit fixes the bug, the current general status of your bazel build is not of production quality.

So, my solution path targets quality, and fixes the bug as a side effect:

ghost commented 1 year ago

The use of rules_foreign_cc should be avoided.

This is an important construct for a lower risk creation of bazel builds.

ghost commented 1 year ago

It is expected that oneDNN is used.

It is standard-procedure to use the exact versions of the original system.

Here: Bazel build must not introduce different behavior by using different versions

parvit commented 1 year ago

@Vertexwahn The three PRs that comprise my fix to this issue are attached.

My test steps (on linux):

  1. remove the entire .cache/bazel folder
  2. clone the repo https://github.com/parvit/rules_oidn and switch to the fix-black-stripes branch
  3. enter the rules_oidn/tests folder
  4. execute bazel run --config=gcc11 //:example
  5. After the run, the file denoised.exr can be found under .cache/bazel/_bazel_root/<build-hash>/execroot/__main__/bazel-out/k8-fastbuild/bin/example.runfiles/__main__/denoised.exr

Hopefully this is all clear and reproducibile without much effort. Regards,

Vertexwahn commented 1 year ago

Resolved by @parvit - Thanks for your help!

ghost commented 1 year ago

@parvit , nicely done.

@Vertexwahn , I do not understand why you refuse all my suggestions re enhancing the overall bazel build quality, it was stated clearly by yourself that you wish so:

within https://medium.com/@Vertexwahn/bazelizing-open-image-denoise-part-1-the-journey-begins-a16c78ea1b88 you wrote:

I personally would prefer to add Bazel support directly to the official main branch, to be honest.

This is of course a highly rational and relevant goal (bazel support right within https://github.com/OpenImageDenoise/oidn)

This is why I suggested early on to do this work in parallel to looking for a bug-fix:

Though I was initially fixated on #7, the bug would have been fixed by applying #10.

nearly all steps mentioned within #5 are necessary to achieve a oidn bazel build that is

Let me know if this would be still of interest for you.

=> A oidn bazel-build that is compatible to the original cmake build