andreas128 / SRFlow

Official SRFlow training code: Super-Resolution using Normalizing Flow in PyTorch
Other
829 stars 112 forks source link

Bus error (core dumped) #19

Open 18813185122 opened 3 years ago

18813185122 commented 3 years ago

First of all thank you very much for your work . when I train x4 super-resolution completely with your code, but after a period of training,it will occur "Bus error (core dumped)".when I use "python -X faulthandler train.py -opt ./confs/SRFlow_DF2K_4X.yml"

it will output: " 21-02-15 21:58:11.131 - INFO: Model [SRFlowModel] is created. 21-02-15 21:58:11.131 - INFO: Resuming training from epoch: 2, iter: 51000. 21-02-15 21:58:11.450 - INFO: Start training from epoch: 2, iter: 51000 /run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/functional.py:3103: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn("The default behavior for interpolate/upsample with float scale_factor changed " <epoch: 2, iter: 51,001, lr:2.500e-04, t:-1.00e+00, td:9.45e-01, eta:-4.14e+01, nll:-1.566e+01> <epoch: 2, iter: 51,002, lr:2.500e-04, t:-1.00e+00, td:8.32e-04, eta:-4.14e+01, nll:-1.597e+01> <epoch: 2, iter: 51,003, lr:2.500e-04, t:1.94e+00, td:2.45e-03, eta:8.01e+01, nll:-1.660e+01> <epoch: 2, iter: 51,004, lr:2.500e-04, t:1.78e+00, td:3.97e-03, eta:7.37e+01, nll:-1.757e+01> <epoch: 2, iter: 51,005, lr:2.500e-04, t:1.77e+00, td:8.54e-04, eta:7.32e+01, nll:-1.686e+01> <epoch: 2, iter: 51,006, lr:2.500e-04, t:2.06e+00, td:6.81e-04, eta:8.52e+01, nll:-1.774e+01> <epoch: 2, iter: 51,007, lr:2.500e-04, t:1.71e+00, td:1.89e-03, eta:7.06e+01, nll:-1.683e+01> <epoch: 2, iter: 51,008, lr:2.500e-04, t:1.93e+00, td:2.01e-03, eta:7.98e+01, nll:-1.652e+01> <epoch: 2, iter: 51,009, lr:2.500e-04, t:1.97e+00, td:2.18e-03, eta:8.16e+01, nll:-1.687e+01> <epoch: 2, iter: 51,010, lr:2.500e-04, t:1.87e+00, td:2.10e-03, eta:7.72e+01, nll:-1.748e+01> <epoch: 2, iter: 51,011, lr:2.500e-04, t:1.78e+00, td:3.10e-03, eta:7.36e+01, nll:-1.672e+01> <epoch: 2, iter: 51,012, lr:2.500e-04, t:2.06e+00, td:3.12e-03, eta:8.51e+01, nll:-1.859e+01> <epoch: 2, iter: 51,013, lr:2.500e-04, t:1.83e+00, td:2.23e-03, eta:7.57e+01, nll:-1.672e+01> <epoch: 2, iter: 51,014, lr:2.500e-04, t:1.81e+00, td:2.39e-03, eta:7.50e+01, nll:-1.772e+01> <epoch: 2, iter: 51,015, lr:2.500e-04, t:1.84e+00, td:1.94e-03, eta:7.60e+01, nll:-1.877e+01> <epoch: 2, iter: 51,016, lr:2.500e-04, t:1.73e+00, td:3.45e-03, eta:7.17e+01, nll:-1.696e+01> <epoch: 2, iter: 51,017, lr:2.500e-04, t:1.84e+00, td:2.32e-03, eta:7.62e+01, nll:-1.874e+01> <epoch: 2, iter: 51,018, lr:2.500e-04, t:2.22e+00, td:2.27e-03, eta:9.18e+01, nll:-1.709e+01> <epoch: 2, iter: 51,019, lr:2.500e-04, t:1.90e+00, td:1.72e-03, eta:7.87e+01, nll:-1.638e+01> <epoch: 2, iter: 51,020, lr:2.500e-04, t:1.77e+00, td:2.30e-03, eta:7.31e+01, nll:-1.529e+01> <epoch: 2, iter: 51,021, lr:2.500e-04, t:1.86e+00, td:3.02e-03, eta:7.70e+01, nll:-1.642e+01> <epoch: 2, iter: 51,022, lr:2.500e-04, t:1.81e+00, td:2.15e-03, eta:7.48e+01, nll:-1.789e+01> <epoch: 2, iter: 51,023, lr:2.500e-04, t:1.85e+00, td:2.35e-03, eta:7.65e+01, nll:-1.866e+01> <epoch: 2, iter: 51,024, lr:2.500e-04, t:1.83e+00, td:2.18e-03, eta:7.57e+01, nll:-1.676e+01> <epoch: 2, iter: 51,100, lr:2.500e-04, t:1.88e+00, td:2.37e-03, eta:7.78e+01, nll:-1.536e+01> <epoch: 2, iter: 51,200, lr:2.500e-04, t:1.90e+00, td:2.51e-03, eta:7.86e+01, nll:-1.572e+01> <epoch: 2, iter: 51,300, lr:2.500e-04, t:1.88e+00, td:2.45e-03, eta:7.75e+01, nll:-1.708e+01> <epoch: 2, iter: 51,400, lr:2.500e-04, t:1.86e+00, td:2.42e-03, eta:7.68e+01, nll:-1.943e+01> <epoch: 2, iter: 51,500, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.76e+01, nll:-1.640e+01> <epoch: 2, iter: 51,600, lr:2.500e-04, t:1.87e+00, td:2.39e-03, eta:7.71e+01, nll:-1.571e+01> <epoch: 2, iter: 51,700, lr:2.500e-04, t:1.88e+00, td:2.43e-03, eta:7.74e+01, nll:-1.633e+01> <epoch: 2, iter: 51,800, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.73e+01, nll:-1.499e+01> <epoch: 2, iter: 51,900, lr:2.500e-04, t:1.87e+00, td:2.43e-03, eta:7.71e+01, nll:-1.538e+01> <epoch: 2, iter: 52,000, lr:2.500e-04, t:1.87e+00, td:2.40e-03, eta:7.70e+01, nll:-1.629e+01> 21-02-15 22:29:40.137 - INFO: Saving models and training states. <epoch: 2, iter: 52,100, lr:2.500e-04, t:1.90e+00, td:2.42e-03, eta:7.79e+01, nll:-1.673e+01> <epoch: 2, iter: 52,200, lr:2.500e-04, t:1.89e+00, td:2.46e-03, eta:7.77e+01, nll:-1.898e+01> <epoch: 2, iter: 52,300, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.77e+01, nll:-1.815e+01> <epoch: 2, iter: 52,400, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.64e+01, nll:-1.801e+01> <epoch: 2, iter: 52,500, lr:2.500e-04, t:1.89e+00, td:2.48e-03, eta:7.76e+01, nll:-1.746e+01> <epoch: 2, iter: 52,600, lr:2.500e-04, t:1.88e+00, td:2.53e-03, eta:7.70e+01, nll:-1.614e+01> <epoch: 2, iter: 52,700, lr:2.500e-04, t:1.87e+00, td:2.44e-03, eta:7.66e+01, nll:-1.496e+01> <epoch: 2, iter: 52,800, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.71e+01, nll:-1.682e+01> <epoch: 2, iter: 52,900, lr:2.500e-04, t:1.87e+00, td:2.48e-03, eta:7.66e+01, nll:-1.676e+01> <epoch: 2, iter: 53,000, lr:2.500e-04, t:1.87e+00, td:2.42e-03, eta:7.62e+01, nll:-1.719e+01> 21-02-15 23:01:01.845 - INFO: Saving models and training states. <epoch: 2, iter: 53,100, lr:2.500e-04, t:1.87e+00, td:2.37e-03, eta:7.62e+01, nll:-1.640e+01> <epoch: 2, iter: 53,200, lr:2.500e-04, t:1.87e+00, td:2.41e-03, eta:7.64e+01, nll:-1.765e+01> <epoch: 2, iter: 53,300, lr:2.500e-04, t:1.89e+00, td:2.45e-03, eta:7.69e+01, nll:-1.725e+01> <epoch: 2, iter: 53,400, lr:2.500e-04, t:1.89e+00, td:2.45e-03, eta:7.70e+01, nll:-1.702e+01> <epoch: 2, iter: 53,500, lr:2.500e-04, t:1.88e+00, td:2.43e-03, eta:7.64e+01, nll:-1.803e+01> <epoch: 2, iter: 53,600, lr:2.500e-04, t:1.88e+00, td:2.42e-03, eta:7.65e+01, nll:-1.760e+01> <epoch: 2, iter: 53,700, lr:2.500e-04, t:1.86e+00, td:2.40e-03, eta:7.57e+01, nll:-1.747e+01> <epoch: 2, iter: 53,800, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.66e+01, nll:-2.144e+01> <epoch: 2, iter: 53,900, lr:2.500e-04, t:1.90e+00, td:2.43e-03, eta:7.72e+01, nll:-1.826e+01> <epoch: 2, iter: 54,000, lr:2.500e-04, t:1.88e+00, td:2.40e-03, eta:7.64e+01, nll:-1.700e+01> 21-02-15 23:32:23.089 - INFO: Saving models and training states. <epoch: 2, iter: 54,100, lr:2.500e-04, t:1.90e+00, td:2.55e-03, eta:7.70e+01, nll:-1.809e+01> <epoch: 2, iter: 54,200, lr:2.500e-04, t:1.89e+00, td:2.46e-03, eta:7.64e+01, nll:-1.832e+01> <epoch: 2, iter: 54,300, lr:2.500e-04, t:1.90e+00, td:2.47e-03, eta:7.67e+01, nll:-1.641e+01> <epoch: 2, iter: 54,400, lr:2.500e-04, t:1.89e+00, td:2.47e-03, eta:7.63e+01, nll:-1.669e+01> <epoch: 2, iter: 54,500, lr:2.500e-04, t:1.88e+00, td:2.43e-03, eta:7.60e+01, nll:-1.491e+01> <epoch: 2, iter: 54,600, lr:2.500e-04, t:1.88e+00, td:2.46e-03, eta:7.58e+01, nll:-1.798e+01> <epoch: 2, iter: 54,700, lr:2.500e-04, t:1.90e+00, td:2.41e-03, eta:7.65e+01, nll:-1.596e+01> <epoch: 2, iter: 54,800, lr:2.500e-04, t:1.88e+00, td:2.53e-03, eta:7.59e+01, nll:-1.580e+01> <epoch: 2, iter: 54,900, lr:2.500e-04, t:1.88e+00, td:2.40e-03, eta:7.58e+01, nll:-1.713e+01> <epoch: 2, iter: 55,000, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.62e+01, nll:-1.874e+01> 21-02-16 00:03:50.708 - INFO: Saving models and training states. <epoch: 2, iter: 55,100, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.62e+01, nll:-1.506e+01> <epoch: 2, iter: 55,200, lr:2.500e-04, t:1.88e+00, td:2.39e-03, eta:7.55e+01, nll:-1.786e+01> <epoch: 2, iter: 55,300, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.58e+01, nll:-1.834e+01> <epoch: 2, iter: 55,400, lr:2.500e-04, t:1.88e+00, td:2.39e-03, eta:7.54e+01, nll:-1.841e+01> <epoch: 2, iter: 55,500, lr:2.500e-04, t:1.89e+00, td:2.41e-03, eta:7.58e+01, nll:-1.820e+01> <epoch: 2, iter: 55,600, lr:2.500e-04, t:1.87e+00, td:2.46e-03, eta:7.51e+01, nll:-1.633e+01> <epoch: 2, iter: 55,700, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.57e+01, nll:-1.660e+01> <epoch: 2, iter: 55,800, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.59e+01, nll:-1.856e+01> <epoch: 2, iter: 55,900, lr:2.500e-04, t:1.90e+00, td:2.45e-03, eta:7.59e+01, nll:-1.613e+01> CUBLAS error: out of memory (3) in magma_sgetrf_gpu_expert at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/sgetrf_gpu.cpp:126 CUBLAS error: not initialized (1) in magma_sgetrf_gpu_expert at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/sgetrf_gpu.cpp:126 Skipping ERROR caught in nll = model.optimize_parameters(current_step): Caught RuntimeError in replica 1 on device 1. Original Traceback (most recent call last): File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, kwargs) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/SRFlowNet_arch.py", line 65, in forward y_onehot=y_label) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/SRFlowNet_arch.py", line 101, in normal_flow y_onehot=y_onehot) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowUpsamplerNet.py", line 213, in forward z, logdet = self.encode(gt, rrdbResults, logdet=logdet, epses=epses, y_onehot=y_onehot) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowUpsamplerNet.py", line 238, in encode fl_fea, logdet = layer(fl_fea, logdet, reverse=reverse, rrdbResults=level_conditionals[level]) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowStep.py", line 84, in forward return self.normal_flow(input, logdet, rrdbResults) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowStep.py", line 103, in normal_flow self, z, logdet, False) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowStep.py", line 35, in "invconv": lambda obj, z, logdet, rev: obj.invconv(z, logdet, rev), File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, *kwargs) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 48, in forward weight, dlogdet = self.get_weight(input, reverse) File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 37, in get_weight dlogdet = torch.slogdet(self.weight)[1] pixels RuntimeError: CUDA error: resource already mapped

<epoch: 2, iter: 56,000, lr:2.500e-04, t:1.86e+00, td:2.41e-03, eta:7.44e+01, nll:-1.589e+01> 21-02-16 00:35:13.687 - INFO: Saving models and training states. <epoch: 2, iter: 56,100, lr:2.500e-04, t:1.88e+00, td:2.41e-03, eta:7.51e+01, nll:-1.545e+01> <epoch: 2, iter: 56,200, lr:2.500e-04, t:1.87e+00, td:2.44e-03, eta:7.48e+01, nll:-1.524e+01> <epoch: 2, iter: 56,300, lr:2.500e-04, t:1.88e+00, td:2.50e-03, eta:7.49e+01, nll:-1.727e+01> <epoch: 2, iter: 56,400, lr:2.500e-04, t:1.85e+00, td:2.40e-03, eta:7.40e+01, nll:-1.717e+01> <epoch: 2, iter: 56,500, lr:2.500e-04, t:1.88e+00, td:2.48e-03, eta:7.48e+01, nll:-1.548e+01> <epoch: 2, iter: 56,600, lr:2.500e-04, t:1.86e+00, td:2.48e-03, eta:7.42e+01, nll:-1.752e+01> <epoch: 2, iter: 56,700, lr:2.500e-04, t:1.88e+00, td:2.48e-03, eta:7.47e+01, nll:-1.669e+01> <epoch: 2, iter: 56,800, lr:2.500e-04, t:1.86e+00, td:2.43e-03, eta:7.40e+01, nll:-1.632e+01> <epoch: 2, iter: 56,900, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.40e+01, nll:-1.778e+01> <epoch: 2, iter: 57,000, lr:2.500e-04, t:1.88e+00, td:2.45e-03, eta:7.47e+01, nll:-1.696e+01> 21-02-16 01:06:23.673 - INFO: Saving models and training states. <epoch: 2, iter: 57,100, lr:2.500e-04, t:1.90e+00, td:2.48e-03, eta:7.54e+01, nll:-1.575e+01> <epoch: 2, iter: 57,200, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.39e+01, nll:-1.667e+01> <epoch: 2, iter: 57,300, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.47e+01, nll:-1.871e+01> <epoch: 2, iter: 57,400, lr:2.500e-04, t:1.88e+00, td:2.50e-03, eta:7.44e+01, nll:-1.781e+01> <epoch: 2, iter: 57,500, lr:2.500e-04, t:1.86e+00, td:2.43e-03, eta:7.36e+01, nll:-1.881e+01> <epoch: 2, iter: 57,600, lr:2.500e-04, t:1.86e+00, td:2.46e-03, eta:7.38e+01, nll:-1.742e+01> <epoch: 2, iter: 57,700, lr:2.500e-04, t:1.87e+00, td:2.43e-03, eta:7.38e+01, nll:-1.726e+01> <epoch: 2, iter: 57,800, lr:2.500e-04, t:1.86e+00, td:2.42e-03, eta:7.34e+01, nll:-1.844e+01> <epoch: 2, iter: 57,900, lr:2.500e-04, t:1.87e+00, td:2.44e-03, eta:7.36e+01, nll:-1.622e+01> <epoch: 2, iter: 58,000, lr:2.500e-04, t:1.87e+00, td:2.46e-03, eta:7.38e+01, nll:-1.635e+01> 21-02-16 01:37:34.238 - INFO: Saving models and training states. <epoch: 2, iter: 58,100, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.44e+01, nll:-1.692e+01> <epoch: 2, iter: 58,200, lr:2.500e-04, t:1.84e+00, td:2.40e-03, eta:7.25e+01, nll:-1.594e+01> <epoch: 2, iter: 58,300, lr:2.500e-04, t:1.87e+00, td:2.43e-03, eta:7.36e+01, nll:-1.747e+01> <epoch: 2, iter: 58,400, lr:2.500e-04, t:1.87e+00, td:2.49e-03, eta:7.34e+01, nll:-1.949e+01> <epoch: 2, iter: 58,500, lr:2.500e-04, t:1.86e+00, td:2.46e-03, eta:7.32e+01, nll:-1.595e+01> <epoch: 2, iter: 58,600, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.32e+01, nll:-1.600e+01> <epoch: 2, iter: 58,700, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.39e+01, nll:-1.668e+01> <epoch: 2, iter: 58,800, lr:2.500e-04, t:1.87e+00, td:2.46e-03, eta:7.33e+01, nll:-1.868e+01> <epoch: 2, iter: 58,900, lr:2.500e-04, t:1.86e+00, td:2.46e-03, eta:7.28e+01, nll:-1.802e+01> <epoch: 2, iter: 59,000, lr:2.500e-04, t:1.86e+00, td:2.47e-03, eta:7.27e+01, nll:-1.569e+01> 21-02-16 02:08:39.673 - INFO: Saving models and training states. <epoch: 2, iter: 59,100, lr:2.500e-04, t:1.87e+00, td:2.42e-03, eta:7.34e+01, nll:-1.721e+01> <epoch: 2, iter: 59,200, lr:2.500e-04, t:1.84e+00, td:2.39e-03, eta:7.21e+01, nll:-1.866e+01> <epoch: 2, iter: 59,300, lr:2.500e-04, t:1.85e+00, td:2.47e-03, eta:7.22e+01, nll:-1.685e+01> <epoch: 2, iter: 59,400, lr:2.500e-04, t:1.88e+00, td:2.45e-03, eta:7.35e+01, nll:-1.809e+01> <epoch: 2, iter: 59,500, lr:2.500e-04, t:1.88e+00, td:2.42e-03, eta:7.35e+01, nll:-1.618e+01> Fatal Python error: Bus error

Thread 0x00007f2e50c69700 (most recent call first): File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 37 in get_weight File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 48 in fFatal Python error: orwarSegmentation faultd

File "/run/meSegmentation fault (core dumped)" (myenv) (python37) [root@master code]#

what should i do ?