EOFError at connection.py

yanghaoxiang7 commented 2 years ago

@alexfrom0815

During training I meet with a problem:

...
    (critic): Sequential(
      (0): Conv2d(64, 4, kernel_size=(1, 1), stride=(1, 1))
      (1): ReLU()
      (2): Flatten()
      (3): Linear(in_features=400, out_features=256, bias=True)
      (4): ReLU()
    )
    (critic_linear): Linear(in_features=256, out_features=1, bias=True)
  )
  (dist): Categorical(
    (linear): Linear(in_features=256, out_features=100, bias=True)
  )
)
Rotation: False
Process ForkProcess-1:
Traceback (most recent call last):
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/HDD_4T/Reps/baselines/baselines/common/vec_env/shmem_vec_env.py", line 123, in _subproc_worker
    cmd, data = pipe.recv()
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Debugging by printing out information, I found the problem of a segmentation fault around here:

(at kfac.py)
      if self.steps % self.Tf == 0:
          # My asynchronous implementation exists, I will add it later.
          # Experimenting with different ways to this in PyTorch.

          self.d_g[m], self.Q_g[m] = torch.symeig(
              self.m_gg[m], eigenvectors=True)
          self.d_a[m], self.Q_a[m] = torch.symeig(
              self.m_aa[m], eigenvectors=True)

          self.d_a[m].mul_((self.d_a[m] > 1e-6).float())
          self.d_g[m].mul_((self.d_g[m] > 1e-6).float())

I guess my problem is at torch.symeig, since I found several issues about this. But different from their running, the code stopped at the first episode (instead of stopping after several hours of training). Is there any solution to this problem? Great thanks!

yanghaoxiang7 commented 2 years ago

BTW, I can run the training code with A2C and the testing code.

yanghaoxiang7 commented 2 years ago

I see that there's a possible way to add to "mask value" but I couldn't find it in config.py

yanghaoxiang7 commented 2 years ago

bug fixed. Problem at acktr/algo/kfac.py. I don't know why but torch.symeig is only compatible under CPU. Running under GPU will lead to a segmentation fault. Solution:

                self.d_g[m], self.Q_g[m] = torch.symeig(
                    self.m_gg[m].cpu(), eigenvectors=True)
                self.d_g[m], self.Q_g[m] = self.d_g[m].cuda(), self.Q_g[m].cuda()
                self.d_a[m], self.Q_a[m] = torch.symeig(
                    self.m_aa[m].cpu(), eigenvectors=True)
                self.d_a[m], self.Q_a[m] = self.d_a[m].cuda(), self.Q_a[m].cuda()

I'm using torch1.7.1 + cuda 11. Not sure why this happen.

suoyike1 commented 2 months ago

BTW, I can run the training code with A2C and the testing code.

how to train this model with a2c? when I run this training code with a2c will have a mistake as follow Traceback (most recent call last): File "main.py", line 233, in main(args) File "main.py", line 24, in main train_model(args) File "main.py", line 99, in train_model args.lr, AttributeError: 'Namespace' object has no attribute 'lr'

yanghaoxiang7 commented 2 months ago

BTW, I can run the training code with A2C and the testing code.

how to train this model with a2c? when I run this training code with a2c will have a mistake as follow Traceback (most recent call last): File "main.py", line 233, in main(args) File "main.py", line 24, in main train_model(args) File "main.py", line 99, in train_model args.lr, AttributeError: 'Namespace' object has no attribute 'lr'

Your errors indicates that your "args" does not have "lr". "lr" is the learning rate and is typically passed through the command line arguments ("args"). Check whether you run the code according to authors' information and you can directly use print("args:", args) to debug. Hope these helps.

alexfrom0815 / Online-3D-BPP-DRL

EOFError at connection.py #12