GPU 3D lut transform causes banding

cessen commented 1 year ago

I'm using 3d luts that span a large dynamic range and need trilinear interpolation for correct results (I'm happy to provide details of my use case if that's helpful).

Due to the nature of the luts, they should provide sufficiently accurate results even at relatively low lut resolutions. However, much to my surprise there is significant banding when applied by OCIO on GPU:

gpu posterized

The only way to mitigate this is to crank the lut resolution extremely high, and even then especially dark colors exhibit some subtle banding. The results by CPU, on the other hand, are correct even with the low resolution lut:

cpu correct

Looking into OCIO's source, this seems to be the culprit:

https://github.com/AcademySoftwareFoundation/OpenColorIO/blob/b00877959f4497f9b8e5a08859d0ef8059fdd8e5/src/OpenColorIO/ops/lut3d/Lut3DOpGPU.cpp#L221-L236

It's even noted in the comment that this can be an issue on some GPUs. For the record, I'm using an nVidia RTX 3060. So not exactly niche.

I think it would make sense to replace the sampleTex3D() call with hand-written trilinear interpolation code, to ensure that it executes with full floating point precision and gives correct results on all platforms. It might be a small perf hit, but for something like OCIO I suspect correctness and consistency is more important here.

Would a PR along those lines be accepted?

cessen commented 1 year ago

Any thoughts/feedback about this? This is a significant problem for the work that I'm doing with OCIO, and I'd like to see it fixed.

I'm happy to do the leg work, but just want to double-check that the proposed solution is acceptable first.

doug-walker commented 1 year ago

Hi @cessen , apologies for the delayed reply. Yes, that's a pretty shocking amount of banding in the first image. It looks like maybe you're applying a 3d-LUT to a linear color space encoding? If so, the banding is probably only one of several problems you will encounter. The best practice is to convert to a log-encoded (or at least gamma-encoded) color space before applying a 3d-LUT.

In practice, OCIO users tend to be using 3d-LUTs that are a minimum of 16x16x16 (and typically larger) and being fed with log-encoded data. So the quantization of the fractional component has not been an issue.

I agree that your proposal of writing a trilinear interpolation shader would work around the problem. (This is what we've done for tetrahedral interpolation already.) However, it seems that there is a risk of introducing performance regressions by not using the built-in GPU trilinear interpolation. Tests would ideally need to be done on a variety of graphics cards (old and new) and a range of 3d-LUT sizes to validate that performance was not impacted.

While we thank you for volunteering, we won't agree in advance to accepting a PR for this. We would need to see experimental results covering a sufficiently broad set of use-cases. Almost certainly there will be some speed penalty and at that point it will be a judgment call about whether the improvement in quality for this edge case warrants the overall performance reduction. This is something that would likely be decided by the project's steering committee.

cessen commented 1 year ago

@doug-walker Thanks for the thorough reply!

While we thank you for volunteering, we won't agree in advance to accepting a PR for this.

Yes, of course. I just wanted to confirm that it wasn't out of the question for some reason. I manage some open source projects myself, and have received PRs that I had to reject due to the scope and goals of the project, and I often wish people would ask ahead of time before putting in the work.

(Edit: I also meant to acknowledge that my phrasing made my intent there very unclear. Sorry about that.)

However, it seems that there is a risk of introducing performance regressions by not using the built-in GPU trilinear interpolation. Tests would ideally need to be done on a variety of graphics cards (old and new) and a range of 3d-LUT sizes to validate that performance was not impacted.

Yeah, I suspected that might be the case.

It occurs to me that an alternative might be to introduce another interpolation mode. Something like INTERP_LINEAR_PRECISE, for use cases where guaranteed full precision is needed.

It looks like maybe you're applying a 3d-LUT to a linear color space encoding?

Although there are certainly cases where we would like to apply smaller LUTs in a linear color space, we've been able to work around that by using higher resolution LUTs and applying them in non-linear color space encodings, as you mentioned. I don't want to completely downplay this, however, because it does mean we have notably higher-res LUTs than needed for the actual transforms we want to do (which is annoying, as the size grows with the cube of the resolution) and it forces those transforms to have otherwise unnecessary log encoding/decoding steps (which performance-wise aren't free either). So this is certainly one of the use cases we'd like full-precision interpolation for. But it's not the main reason that I opened this issue.

An example of the kind of use case where this is more of a problem is working with HDRIs. Unlike footage from principal photography, HDRIs can have a (potentially huge) dynamic range that isn't bounded in advance. So to be conservative and avoid accidental clipping, a 3D LUT intended for HDRIs needs to have a huge extent. Even using a nonlinear color space encoding, LUTs with such extents can still exhibit banding with the current GPU interpolator.

To further exacerbate things, a standard RGB LUT of such large extent doesn't have enough chroma resolution at the low end to do meaningful color transformations. We've solved this by constructing our LUTs to be applied in OCIO's variant of HSV space, so that it has the same chroma resolution at all luminance levels. And that's why tetrahedral interpolation doesn't solve the issue for us, because tetrahedral isn't appropriate for HSV-like spaces and produces strange results. Additionally, you only want the V channel to be log encoded in that case, but the various log transforms in OCIO don't support selective application to single channels. So we're stuck with ExponentTransform and friends for our non-linear encoding, which exacerbates the problem even further.

We could work around this by e.g. applying the color transform ahead of time via the CPU path and saving to an EXR. But we have other use cases with similar issues, where that kind of thing is harder/more obnoxious to do.

So that's the practical thing we're running into.

From a user expectations side, it additionally just feels strange that the CPU and GPU path can diverge in their results by that much. In something like OCIO I would expect the GPU path to be faster, but not lower quality.

doug-walker commented 1 year ago

Thanks for elaborating on the use-case.

I don't think adding INTERP_LINEAR_PRECISE is the best solution. It seems more like an optimization setting than an interpolation method. Take a look at the OptimizationFlags in OpenColorTypes.h. What do you think about the approach of adding a new setting OPTIMIZATION_NATIVE_GPU_TRILINEAR to turn this on and off. It should be off for OPTIMIZATION_LOSSLESS but on for OPTIMIZATION_VERY_GOOD or higher. It would be similar to OPTIMIZATION_FAST_LOG_EXP_POW in the sense that it's controlling the evaluation method rather than the steps used to reduce the number of ops.

That way, by default people will continue to get the current behavior but could opt in to your shader if desired. I think that would lower the amount of performance testing required, since it would not be used by default. However, if we could get enough performance testing done, it could potentially be deployed beyond the LOSSLESS setting.

cessen commented 1 year ago

What do you think about the approach of adding a new setting OPTIMIZATION_NATIVE_GPU_TRILINEAR to turn this on and off. It should be off for OPTIMIZATION_LOSSLESS but on for OPTIMIZATION_VERY_GOOD or higher.

I think that's an excellent way forward. That will also make it a little easier to do the performance testing, and for other people to join in if desired, since it won't need a custom branch.

I'll work on putting together a PR for that the next chance I get.

Thanks so much!

cessen commented 1 year ago

I've started a WIP PR in #1794. It's functional, but I have some questions about how best to proceed in terms of API impact. Despite several different attempts, I couldn't find a way to get the optimization flags passed where they needed to be without affecting public APIs, which I'd like to avoid.

cessen commented 11 months ago

This is still an outstanding issue for the work that I'm doing. The PR I previously submitted immediately stalled (I had outstanding questions but failed to make it clear I was looking for feedback, and then got busy with other things).

In any case, it's perhaps for the best that it stalled, because I'm re-thinking whether the optimization level approach is really best here, and I'm again thinking that adding something like a INTERP_LINEAR_PRECISE interpolation mode might be a more appropriate solution.

I'm returning to that because—as far as I can tell—the optimization flags are largely under the control of the software that's integrating OCIO, and there's no way to control it from an OCIO config. Is that right? If so, that means any configuration that depends on full-precision LUT evaluation in order to function properly won't be portable across different software applications. And this is certainly relevant for our use case: we use our OCIO configurations across multiple pieces of software, and we don't have control over how they integrate OCIO.

Additionally, if the current reduced-precision LUT behavior is considered an acceptable optimization, it also seems reasonable for a configuration to use that optimization where the reduced precision is sufficient, and only selectively use the slower full-precision LUT evaluation where it's actually needed.

And as a side-benefit, simply adding a new interpolation mode looks to be less invasive than the changes needed to thread the optimization flag through all the needed APIs for the optimization level approach to work.

Thoughts?

AcademySoftwareFoundation / OpenColorIO

GPU 3D lut transform causes banding #1763