InterDigitalInc / CompressAI

A PyTorch library and evaluation platform for end-to-end compression research
https://interdigitalinc.github.io/CompressAI/
BSD 3-Clause Clear License
1.19k stars 232 forks source link

Floating point exception during model.update() #63

Closed cvignac closed 3 years ago

cvignac commented 3 years ago

Bug

Hello, I get a floating point exception when trying to update a CompressionModel. There is no stack for the error message, so I guess it comes from an internal C module.

Using prints, I could trace that the problem comes from:

model.update() -> self._pmf_to_cdf(pmf, tail_mass, pmf_length, max_length) -> _cdf = pmf_to_quantized_cdf(prob, self.entropy_coder_precision)

and that it was raised when calling the function on a Tensor p where all entries but one are zero

To Reproduce

Steps to reproduce the behavior:

Call

prob = torch.cat((p[: pmf_length[i]], tail_mass[i]), dim=0)
 _cdf = pmf_to_quantized_cdf(prob, self.entropy_coder_precision)

on the following tensor:

tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.6690e-33, 6.5444e-11, 1.0000e+00, 1.2011e-13, 1.0696e-36, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00], device='cuda:0', grad_fn=)

Expected behavior

I don't know what the returned value should be, but it seems that my problem is a corner case incorrectly handled

Environment

PyTorch version: 1.7.0 Is debug build: True CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.4 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2

Python version: 3.7 (64-bit runtime) Is CUDA available: True CUDA runtime version: 9.1.85 GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB GPU 1: Tesla V100-SXM2-32GB GPU 2: Tesla V100-SXM2-32GB GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 440.33.01 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.19.4 [pip3] pytorch-msssim==0.2.0 [pip3] torch==1.7.0 [pip3] torch-cluster==1.5.8 [pip3] torch-geometric==1.6.3 [pip3] torch-scatter==2.0.5 [pip3] torch-sparse==0.6.8 [pip3] torch-spline-conv==1.2.0 [pip3] torchvision==0.8.1 [conda] numpy 1.19.4 pypi_0 pypi [conda] pytorch-msssim 0.2.0 pypi_0 pypi [conda] torch 1.7.0 pypi_0 pypi [conda] torch-cluster 1.5.8 pypi_0 pypi [conda] torch-geometric 1.6.3 pypi_0 pypi [conda] torch-scatter 2.0.5 pypi_0 pypi [conda] torch-sparse 0.6.8 pypi_0 pypi [conda] torch-spline-conv 1.2.0 pypi_0 pypi

- PyTorch / CompressAI Version (e.g., 1.0 / 0.4.0): torch   1.7.0, compressai  1.1.5
- OS (e.g., Linux): Ubuntu 18.04.4 LTS (Bionic Beaver)
- How you installed PyTorch / CompressAI (`pip`, source): pip 
- Build command you used (if compiling from source):
- Python version: 1.7.0
- CUDA/cuDNN version: 10.2
- GPU models and configuration:
- Any other relevant information: Problem appears both on cpu and gpu
jbegaint commented 3 years ago

Thanks a lot for the detailed report! i'll look into it.

jbegaint commented 3 years ago

Hi @cvignac, could you provide me with a gist or script to reproduce the issue? I could not do it on bare metal, and neither in a ubuntu:18.04 docker image.

cvignac commented 3 years ago

Sorry, my description was in fact incorrect. The tensor that raised an exception was one that contains only zeros. I cannot really tell you why it appears, but it still appeared after 3 epochs of training, and disappeared now that I have trained for longer (20 epochs).

It seems normal that an exception is raised when computing the cdf of a zero tensor, but I don't know how to prevent these tensors from happening.

jbegaint commented 3 years ago

ok thanks, i'll look into it.

jbegaint commented 3 years ago

ok so indeed we don't do any checks when computing the CDF. I'll fix that. Regarding the zero tensors, I'm not sure why it happens during training. Which models are you training?

cvignac commented 3 years ago

I'm training a custom model for graph compression with your entropy_bottleneck class.

It may be worth mentioning that my input to the entropy bottleneck is of shape 1, C, 1, N, where C is the number of channels for the entropy bottleneck and N the number of nodes. It probably does not matter because of the flatten operation in the entropy bottleneck, but I mention it just in case.

jbegaint commented 3 years ago

could you share a small example so I could run some tests on my end?

jbegaint commented 3 years ago

Closing stale issue. If you think it should remain open, feel free to reopen it.