Closed Vertexwahn closed 1 year ago
Hi, i was also able to reproduce the issue, i'm in the process of debugging. Only thing i was able to extract to this moment is that the issue is specifically in the filter execution. I'll let you know if i get more details for a possible fix.
Hi, i've extracted the image data after the execution of the first layer of the network in both cases, there seems to be a sharp difference between the outputs.
I've confirmed that the tza files used by both are the same, but could this indicate an error in the loading of the weights?
== CMAKE ==
== BAZEL ==
Hi, i think i'm zeroing in on the issue, i've changed both the versions of InputReorder_kernel to just set the entire tile to a solid 1.0f value, this way:
Here the results:
== CMAKE ==
== BAZEL ==
So what i think now is that is some kind of memory alignment issue that writes wrongly the calculated values.
@Vertexwahn , it looks like @parvit goes the code-level way to fix this issue, which is fine.
I go the "technical project management" way trying to strip the "bazel" requirement, because of the high effort to support an alternate (to what the project uses) build-system.
You're risking to invest more and more time in tool/build maintenance. Take a look at:
@Vertexwahn @abebeos I assumed the bazel usage was the whole point of the task, but if its not required than we could make it so the cmake generation procures and builds automtically all the dependencies.
Let me know if this is a route you are interested in, in the meantime i'll continue my investigation.
I assumed the bazel usage was the whole point of the task
It is. I usually start any task with Requirements & Feasibility analysis (the result of which will most possibly be: "MUST Requirement: Bazel). So all good, I guess.
@Vertexwahn @abebeos I assumed the bazel usage was the whole point of the task,
Exactly.
@abebeos @parvit The goal of the Bounty is to use "pure" Bazel, not CMake.
@Vertexwahn , I looked at all this, did some trials with your sources, reviewed your initial PR etc.
What you need additionally is:
Let me know if you're willing to add an additional bounty (can be again $500) for those tasks.
My estimation is that within a timeframe of 2 weeks (not fulltime of course) this can be solved.
(i would open a project, file some issues that would need action from your side, estimated 1 hour effort. And now and then a question.)
@parvit still working on this?
Yes i have found some anomalies in the tensor values but nothing conclusive yet
My attempt is abstractly, without detail involvement.
How familiar are you with bazel? (My mindset is more around maven/cmake.)
Could you attempt to bypass the unpack step as described within https://github.com/Vertexwahn/rules_oidn/issues/8 ?
@Vertexwahn , you'll find my preliminary results here:
The effort to apply those findings/changes to your sources is very high for me (as I do not know neither bazel nor your mindset/code), so I would prefer if you would take this on (as you're fully familiar with what you wrote). Should be more efficient this way. You would start with this:
And go on with the others in the list. Note that those tasks need to be applied to the source-code anyways.
As for the long run:
I still believe that the best way to solve this would be a oidn fork with bazel support (one that I would keep "clean" right from the beginning, e.g. add only the minimum needed changes, always strive to use the exact same source-code that should always function with cmake). But the effort for this is beyond a $500 bounty.
My attempt is abstractly, without detail involvement.
How familiar are you with bazel? (My mindset is more around maven/cmake.)
Could you attempt to bypass the unpack step as described within https://github.com/Vertexwahn/rules_oidn/issues/8 ?
You might have missed in my previous comments that the weight files are the binary same so i don't think that helps.
the binary same
That is exactly what i did, a binary comparison of the files.
Il Mar 14 Mar 2023, 20:00 abebeos @.***> ha scritto:
the binary same
8 https://github.com/Vertexwahn/rules_oidn/issues/8 verifies the tiny
possibility that something went wrong during extraction. Could be verified manually, too, i guess, via file comparison.
— Reply to this email directly, view it on GitHub https://github.com/Vertexwahn/rules_oidn/issues/1#issuecomment-1468666554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4EOU47DTPU543KK6CTW4C555ANCNFSM6AAAAAAVLGHDB4 . You are receiving this because you were mentioned.Message ID: @.***>
That is exactly what i did, a binary comparison of the files.
You wrote earlier:
I've confirmed that the tza files used by both are the same,
"tza" files. I'm referring to the resulting cpp/h source files
In any way, i'm awaiting @Vertexwahn here, many things need a rollback/change.
Though a full replay would be the best option.
Also those binaries are exactly the same.
Hi @Vertexwahn i have found the issue, i'll update tomorrow on the details
Also those binaries are exactly the same.
Ok, thus now we can say that
i have found the issue, i'll update tomorrow on the details
@Vertexwahn, independent of this, we should do a replay, thus you can avoid similar incidents in future
The expected fix should identify the root cause
Root cause identified as "wrong source-code version used for mkl_dnn"
My cross-tests show that those versions are not compatible.
Changing your bazel build-scripts is not trivial, though, due to the used build files from "TensorFlow".
You'll find more details on how to proceed within.
My fix is about the missed definition in ipcs_rules of the OIDN_DNNL define, so no version change necessary.
@Vertexwahn can you indicate which fix would you prefer as i have to prepare the fixes in the bazel projects, if you don't think you'll accept my solution than i can stop.
Regards,
Il Mer 15 Mar 2023, 11:34 abebeos @.***> ha scritto:
The expected fix should identify the root cause
Root cause identified as "wrong source-code version used for mkl_dnn"
My cross-tests show that those versions are not compatible.
Changing your bazel build-scripts is not trivial, though, due to the used build files from "TensorFlow".
You'll find more details on how to proceed within.
— Reply to this email directly, view it on GitHub https://github.com/Vertexwahn/rules_oidn/issues/1#issuecomment-1469756056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4ESTQ52UZNMYCCDJGDW4GLNDANCNFSM6AAAAAAVLGHDB4 . You are receiving this because you were mentioned.Message ID: @.***>
My fix is about the missed definition in ipcs_rules
if you've validated/verified your fix, the I'd just go on ('cause Vertexwahn seems to be busy with other work, no replies since days).
The "version changes" and the overall build-system replay/rewrite is anyways a necessity (= MUST requirement) to achieve production-grade bazel support. This would be independent of some direct fix. That's why I asked initially for an additional bounty within https://github.com/Vertexwahn/rules_oidn/issues/1#issuecomment-1464976927.
@Vertexwahn , #7 further confirmed (oidn uses a customized 2.4.2 version of mkl-dnn, with several differences in source-code, compared to the oneapi version 2.4.2)
@parvit
My fix is about the missed definition in ipcs_rules of the OIDN_DNNL define, so no version change necessary.
Do mean rules_oidn or rules_ispc?
I switched to oneDNN-2.7.3 in the meanwhile - but I am also fine with downgrading again.
I am curious what you mean by "missed definition in ipcs_rules of the OIDN_DNNL define" - I have no idea, to be honest - looking forward to a PR!
@Vertexwahn can you indicate which fix would you prefer as i have to prepare the fixes in the bazel projects, if you don't think you'll accept my solution than i can stop.
As stated in "Expected fix" I will accept anything that fulfills this:
The expected fix should identify the root cause (maybe ispc_rules, some memory layout somewhere, etc.) and provide PRs to fix it. The denoised image should look identical to the CMake build. Could be that the fix will be less than 10 lines in code - maybe single line change. In the end rules_oidn should work out of the box on Windows and Ubuntut 22.04 (as it does currently - but without the black stripes) - macOS is not expected to work since this requires a different version of oneDNN which is out of the scope of this issue.
As stated previously the solution should be a pure Bazel solution without the use of CMake.
The use of rules_foreign_cc should be avoided. It is expected that oneDNN is used. If the fix requires the change of other repos (e.g. rules_ispc) you should also provides those changes - either as Pull Request or patch files.
The expected fix should identify the root cause Root cause identified as "wrong source-code version used for mkl_dnn" My cross-tests show that those versions are not compatible. Changing your bazel build-scripts is not trivial, though, due to the used build files from "TensorFlow". You'll find more details on how to proceed within. - #7 <#7> — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4ESTQ52UZNMYCCDJGDW4GLNDANCNFSM6AAAAAAVLGHDB4 . You are receiving this because you were mentioned.Message ID: @.***>
You can ignore what @abebeos is saying - since I follow here the policy "He who pays the piper calls the tune". Changing bazel build scripts is maybe for @abebeos difficult, but not for me ;)
@Vertexwahn Thanks for the clarification i feel a lot more confident now about the state of the issue.
The fix on rules_ispc (sorry typo i was on mobile) is going to be pure Bazel, i'm in fact in the process of preparing the PRs on the relevant repos, unfortunately i have limited time slots to work on it so bear with me but it should be ready shortly :D
It will augment the calls to ispc_cc_library
to pass a set of defines (configurable of course) which are missing (config.h is not included in the source headers so they are passed on the command line of ispc) and without those the wrong index calculations happen in the tensor.isph
which cause the issue.
Fixing that then produces the same output on both builds so i'm confident you'll be happy with the result, i'll update as soon as i'm ready.
Regards,
@parvit Sounds awesome!
You can ignore what @abebeos is saying - since I follow here the policy "He who pays the piper calls the tune".
No need for such statements, my comments were always either neutral or encouraging.
Changing bazel build scripts is maybe for @abebeos difficult, but not for me ;)
It's not about difficulty, but about reducing effort. Producing a simple to understand and to maintain result is the difficult (and time consuming) task.
So, even if @parvit fixes the bug, the current general status of your bazel build is not of production quality.
So, my solution path targets quality, and fixes the bug as a side effect:
The use of rules_foreign_cc should be avoided.
This is an important construct for a lower risk creation of bazel builds.
It is expected that oneDNN is used.
It is standard-procedure to use the exact versions of the original system.
Here: Bazel build must not introduce different behavior by using different versions
@Vertexwahn The three PRs that comprise my fix to this issue are attached.
My test steps (on linux):
https://github.com/parvit/rules_oidn
and switch to the fix-black-stripes
branchbazel run --config=gcc11 //:example
denoised.exr
can be found under .cache/bazel/_bazel_root/<build-hash>/execroot/__main__/bazel-out/k8-fastbuild/bin/example.runfiles/__main__/denoised.exr
Hopefully this is all clear and reproducibile without much effort. Regards,
Resolved by @parvit - Thanks for your help!
@parvit , nicely done.
@Vertexwahn , I do not understand why you refuse all my suggestions re enhancing the overall bazel build quality, it was stated clearly by yourself that you wish so:
within https://medium.com/@Vertexwahn/bazelizing-open-image-denoise-part-1-the-journey-begins-a16c78ea1b88 you wrote:
I personally would prefer to add Bazel support directly to the official main branch, to be honest.
This is of course a highly rational and relevant goal (bazel support right within https://github.com/OpenImageDenoise/oidn)
This is why I suggested early on to do this work in parallel to looking for a bug-fix:
Though I was initially fixated on #7, the bug would have been fixed by applying #10.
nearly all steps mentioned within #5 are necessary to achieve a oidn bazel build that is
Let me know if this would be still of interest for you.
=> A oidn bazel-build that is compatible to the original cmake build
Note: There is a $500.00USD bounty on this issue
Apparently, there are black stripes in the denoised image:
Currently, the root cause for this is unclear. I opened a question about this on StackOverflow with no success so far.
I document also some of my doings and findings in some Medium posts:
How to reproduce
Run example with Ubuntu 22.04:
Run example with Visual Studio 2022:
The example takes a noisy image and denoises it.
Unfortunately, the denoised image contains black stripes.
I tested the same input with https://github.com/DeclanRussell/IntelOIDenoiser and got the expected result (without black stripes). Therefore, the problem must be within my bazalization of OIDN or my supporting around it.
If I choose a constant color image, e.g.
I also get black stripes.
I also created a branch of oidn that contains a "direct" bazelization of oidn which is almost similar to rules_oidn: https://github.com/Vertexwahn/oidn/tree/add-bazel-support. In this branch I bazelized
oidnTest
which contains a few test cases which all pass with success. I bazelized alsooidnDenoiser
. You can run it via:The generated file
denoised.pfm
shows the same black stripes. It seems that the black stripes are always 5 pixel wide followed by 3 correct looking color stripes.Not sure if there are any problems with handing over memory to ISPC, or doing the Filter Operation, etc. Since I tested different image formats (OpenEXR, PFM), I assume that the error is not within my way of storing images.
Expected fix
The expected fix should identify the root cause (maybe ispc_rules, some memory layout somewhere, etc.) and provide PRs to fix it. The denoised image should look identical to the CMake build. Could be that the fix will be less than 10 lines in code - maybe single line change. In the end rules_oidn should work out of the box on Windows and Ubuntut 22.04 (as it does currently - but without the black stripes) - macOS is not expected to work since this requires a different version of oneDNN which is out of the scope of this issue.