InterDigitalInc / CompressAI

A PyTorch library and evaluation platform for end-to-end compression research
https://interdigitalinc.github.io/CompressAI/
BSD 3-Clause Clear License
1.19k stars 232 forks source link

The reconstructed image using codec.py is inconsistent with the original image #279

Open ZSBSB opened 7 months ago

ZSBSB commented 7 months ago

微信截图_20240326094925 微信截图_20240326094956 Thank you for your contribution! I used one computer to compress the original image into a code stream, and then used another computer to decode it [the hardware configurations (CPU and GPU) of the two computers are inconsistent], and the reconstructed image has many black square dots. If you use the same computer for encoding and decoding, there will be no such problem. If the configurations of the two computers are exactly the same, this problem will not occur. Is this normal?

YodaEmbedding commented 7 months ago

Corrupted reconstruction of images often occurs because the decoder decodes a different latent than what was encoded. There may be other possible reasons (e.g., the input image lies far outside the distribution that the model was trained for), but let's focus on the most common one.

What often happens is that the encoder and decoder are not "in-sync". Side-information and autoregressive models are the most susceptible to this. This is because even small differences in the encoding distribution (e.g., discretized Gaussian) at the decoder-side can result in incorrectly decoded symbols. This in turn throws the context model (which relies upon previously decoded symbols to generate new encoding distributions) out-of-sync. This results in a cascading series of accumulating errors, and everything blows up quite quickly.

Using a precomputed discretized Gaussian table offers a bit of built-in protection against differences in reconstructed scales ($\sigma$). CompressAI models usually make use of this during compress/decompress. The protection for decoding a single symbol is probabilistically proportional to the density of the floating point values within the operating range (i.e., p ≈ 1e-6 for float32... though it may be better, depending on the situation...). Unfortunately, the means ($\mu$) might be less happy here, though, since they are not quantized, and may vary across different runs.

What causes these small differences in computed values, which then throw the decoder out-of-sync? A common reason is the use of non-deterministic operations. Even the same sum/reduction operations executed on the same hardware might give different results across different runs.

Non-deterministic reductions

Also, the current implementation may not produce the same results on different devices due to non-deterministic hardware computations. Thus, encoding on device A and decoding on device B may not work. Even the same device may fail if it doesn't do vector floating point computations deterministically. For instance, the result of a reduction sum depends on the reduction order, which may vary depending on which compute units finish first, the reduction tree enforced by a hardware unit, or other reasons.

(0.1 + 0.2) + 0.3   !=   0.1 + (0.2 + 0.3)

Evaluating each side gives different results:

>>> (0.1 + 0.2) + 0.3
0.6000000000000001

>>> 0.1 + (0.2 + 0.3)
0.6

...floating point addition is not exactly associative!

[Source]

Some proposed remedies usually involve quantization of all the operands involved in the operations. Though even here, one must be careful to ensure properties like associativity (a + b) + c = a + (b + c) hold to facilitate provable determinism. There's probably more about these remedies in recent research papers, if you're curious.


Related issues/comments:

ZSBSB commented 7 months ago

由于解码器解码的潜在信息与编码的潜在信息不同,因此经常会发生图像重建损坏的情况。可能还有其他可能的原因(例如,输入图像远远超出模型训练的分布),但让我们关注最常见的一个。

经常发生的情况是编码器和解码器不“同步”。边信息和自回归模型最容易受到此影响。这是因为,即使解码器侧的编码分布(例如,离散高斯分布)中的微小差异也可能导致符号解码不正确。这反过来又导致上下文模型(依赖于先前解码的符号来生成新的编码分布)不同步。这会导致一系列级联的累积错误,一切都会很快崩溃。

使用预先计算的离散高斯表提供了一些内置的保护,以防止重建尺度的差异(σ)。compressCompressAI 模型通常在/期间使用此功能decompress。对单个符号进行解码的保护在概率上与操作范围内的浮点值的密度成正比(即,对于float32 ,p ≈ 1e-6 ...尽管它可能更好,具体取决于情况...)。不幸的是,手段(μ)不过,这里可能不太高兴,因为它们_没有_量化,并且在不同的运行中可能会有所不同。

是什么导致了计算值的这些微小差异,从而导致解码器不同步?一个常见的原因是使用非确定性操作。_即使在相同硬件上_执行相同的求和/归约操作,在不同的运行中也可能会产生不同的结果。__

非确定性归约

此外,由于硬件计算的不确定性,当前的实现可能不会在不同设备上产生相同的结果。因此,设备 A 上的编码和设备 B 上的解码可能不起作用。即使是同一个设备,如果不能确定地进行矢量浮点计算,也可能会失败。例如,归约和的结果取决于归约顺序,归约顺序可能会根据首先完成的计算单元、硬件单元强制执行的归约树或其他原因而变化。

(0.1 + 0.2) + 0.3   !=   0.1 + (0.2 + 0.3)

评估每一方都会给出不同的结果:

>>> (0.1 + 0.2) + 0.3
0.6000000000000001

>>> 0.1 + (0.2 + 0.3)
0.6

...浮点加法并不_完全_结合! [来源]

一些提出的补救措施通常涉及对运算中涉及的所有操作数进行量化。但即使在这里,我们也必须小心确保诸如结合性之类的性质(a + b) + c = a + (b + c)能够促进可证明的决定论。如果您好奇的话,最近的研究论文中可能有更多关于这些补救措施的信息。

相关问题/评论:

Thank you very much for your reply, I will check the possible reasons.