Closed cvignac closed 3 years ago
Thanks a lot for the detailed report! i'll look into it.
Hi @cvignac, could you provide me with a gist or script to reproduce the issue? I could not do it on bare metal, and neither in a ubuntu:18.04 docker image.
Sorry, my description was in fact incorrect. The tensor that raised an exception was one that contains only zeros. I cannot really tell you why it appears, but it still appeared after 3 epochs of training, and disappeared now that I have trained for longer (20 epochs).
It seems normal that an exception is raised when computing the cdf of a zero tensor, but I don't know how to prevent these tensors from happening.
ok thanks, i'll look into it.
ok so indeed we don't do any checks when computing the CDF. I'll fix that. Regarding the zero tensors, I'm not sure why it happens during training. Which models are you training?
I'm training a custom model for graph compression with your entropy_bottleneck class.
It may be worth mentioning that my input to the entropy bottleneck is of shape 1, C, 1, N, where C is the number of channels for the entropy bottleneck and N the number of nodes. It probably does not matter because of the flatten operation in the entropy bottleneck, but I mention it just in case.
could you share a small example so I could run some tests on my end?
Closing stale issue. If you think it should remain open, feel free to reopen it.
Bug
Hello, I get a floating point exception when trying to update a CompressionModel. There is no stack for the error message, so I guess it comes from an internal C module.
Using prints, I could trace that the problem comes from:
model.update() -> self._pmf_to_cdf(pmf, tail_mass, pmf_length, max_length) -> _cdf = pmf_to_quantized_cdf(prob, self.entropy_coder_precision)
and that it was raised when calling the function on a Tensor p where all entries but one are zero
To Reproduce
Steps to reproduce the behavior:
Call
on the following tensor:
tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.6690e-33, 6.5444e-11, 1.0000e+00, 1.2011e-13, 1.0696e-36, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00], device='cuda:0', grad_fn=)
Expected behavior
I don't know what the returned value should be, but it seems that my problem is a corner case incorrectly handled
Environment
PyTorch version: 1.7.0 Is debug build: True CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.4 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2
Python version: 3.7 (64-bit runtime) Is CUDA available: True CUDA runtime version: 9.1.85 GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB GPU 1: Tesla V100-SXM2-32GB GPU 2: Tesla V100-SXM2-32GB GPU 3: Tesla V100-SXM2-32GB
Nvidia driver version: 440.33.01 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] numpy==1.19.4 [pip3] pytorch-msssim==0.2.0 [pip3] torch==1.7.0 [pip3] torch-cluster==1.5.8 [pip3] torch-geometric==1.6.3 [pip3] torch-scatter==2.0.5 [pip3] torch-sparse==0.6.8 [pip3] torch-spline-conv==1.2.0 [pip3] torchvision==0.8.1 [conda] numpy 1.19.4 pypi_0 pypi [conda] pytorch-msssim 0.2.0 pypi_0 pypi [conda] torch 1.7.0 pypi_0 pypi [conda] torch-cluster 1.5.8 pypi_0 pypi [conda] torch-geometric 1.6.3 pypi_0 pypi [conda] torch-scatter 2.0.5 pypi_0 pypi [conda] torch-sparse 0.6.8 pypi_0 pypi [conda] torch-spline-conv 1.2.0 pypi_0 pypi