SSCHAcode / python-sscha

The python implementation of the Stochastic Self-Consistent Harmonic Approximation (SSCHA).
GNU General Public License v3.0
55 stars 21 forks source link

If asking for the queue fails, the full job is resubmitted #78

Closed mesonepigreco closed 2 years ago

mesonepigreco commented 2 years ago

If the communication with the cluster fails in checking the queue, then the full job is resubmitted. This should be avoided, and at least demanded a few times.

To be fixed before pull request:

72

A report of the error:

kex_exchange_identification: read: Connection reset by peer
kex_exchange_identification: Connection closed by remote host
Error with cmd: ssh  daint 'squeue -u $USER'

EXITSTATUS: 255; attempt = 1
Exception in thread Thread-10:
Traceback (most recent call last):
  File "/home/lmonacelli/anaconda3/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/lmonacelli/anaconda3/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/lmonacelli/anaconda3/lib/python3.9/site-packages/sscha/Cluster.py", line 1452, in compute_single_jobarray
    subs, indices, labels = self.batch_submission(structures, calc, jobs_id, ".pwi",
  File "/home/lmonacelli/anaconda3/lib/python3.9/site-packages/sscha/Cluster.py", line 922, in batch_submission
    while not self.check_job_finished(job_id):
  File "/home/lmonacelli/anaconda3/lib/python3.9/site-packages/sscha/Cluster.py", line 1000, in check_job_finished
    if data[0] == job_id:
IndexError: list index out of range
mesonepigreco commented 2 years ago

The fix has been committed, before closing some testing is required.