aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
436 stars 190 forks source link

calculation stuck in RETRIEVING state if remote data is erased #162

Closed aiida-bot closed 7 years ago

aiida-bot commented 8 years ago

Originally reported by: Nicolas Mounet (Bitbucket: mounet, GitHub: nmounet)


If the remote data that is supposed to be retrieved, is erased before the calculation proceeds to the "RETRIEVING" state, the calculation remains in this state forever. This can happen if for instance the daemon is stopped for a long time, and the remote data on the cluster erased in the meanwhile.


aiida-bot commented 8 years ago

Original comment by Nicolas Mounet (Bitbucket: mounet, GitHub: nmounet):


This is part of more general issues about how to handle lack of memory problems.

aiida-bot commented 8 years ago

Original comment by Nicolas Mounet (Bitbucket: mounet, GitHub: nmounet):


Wrong diagnostic: this happens rather when there is no more space available on the AiiDA server side; calculation stays in RETRIEVING state even when space is freed.

Output of 'verdi calculation logshow':

1001984: RETRIEVING Scheduler output: N/A Scheduler errors: N/A 1 LOG MESSAGES: +-> ERROR at 2016-01-03 19:42:10.277548+00:00 | Error retrieving calc 1001984. Traceback: Traceback (most recent call last): | File "/home/mounet/Documents/Soft/AiiDA/epfl-aiida/aiida/execmanager.py", line 755, in retrieve_computed_for_authinfo | ignore_nonexisting=True) | File "/home/mounet/Documents/Soft/AiiDA/epfl-aiida/aiida/transport/plugins/ssh.py", line 884, in get | self.getfile( remotepath,localpath,callback,dereference,overwrite ) | File "/home/mounet/Documents/Soft/AiiDA/epfl-aiida/aiida/transport/plugins/ssh.py", line 915, in getfile | return self.sftp.get(remotepath,localpath,callback) | File "/usr/local/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 720, in get | size = self.getfo(remotepath, fl, callback) | File "/usr/local/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 694, in getfo | fl.write(data) | IOError: [Errno 28] No space left on device

aiida-bot commented 8 years ago

Original comment by Giovanni Pizzi (Bitbucket: pizzi, GitHub: giovannipizzi):


There should be some message in the log file (~/.aiida/daemon/log/aiida_daemon.log) that can help to properly catch the exception - could you check if you have any relevant message and post it here? :-)