geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
58 stars 22 forks source link

Occasional smallbaselineApp.py dask error for numWorker=40 #510

Open falkamelung opened 2 years ago

falkamelung commented 2 years ago

Using mintpy.numWorker=40 I get occasional errors. Here one example (can be obtained from jetstream). With mintpy.numWorker=36 it works fine.

Can we automatically use two thirds of the available cores? Then we don't have to specify a numWorker number.

MakranChunk27SenDT49

MakranChunk27SenDT49
cat smallbaseline_wrapper_8681675.e
ls: cannot access mintpy/timeseries*: No such file or directory
distributed.scheduler - ERROR - Couldn't gather keys {'ifgram_inversion_patch-4c90a9c6f26c6091623ad4786220ea89': ['tcp://127.0.0.1:43841'], 'ifgram_inversion_patch-0503ce9ee42e6a8ea5b6582a9f6427ad': ['tcp://127.0.0.1:45677'], 'ifgram_inversion_patch-c837056f42c79cdfa821d2b080b76cd6': ['tcp://127.0.0.1:42637'], 'ifgram_inversion_patch-c671ce973d15aa8b4a3004f540c9ae90': ['tcp://127.0.0.1:41419'], 'ifgram_inversion_patch-8939446c99162fb18a5dae48b2d80a7a': ['tcp://127.0.0.1:41829'], 'ifgram_inversion_patch-895cd48c7c721a02adee3ae89fa19a01': ['tcp://127.0.0.1:45006'], 'ifgram_inversion_patch-358952f4b0a73a5521ab2b63216e1094': ['tcp://127.0.0.1:44793'], 'ifgram_inversion_patch-96d976ac9080486bab319b165120ba40': ['tcp://127.0.0.1:44732'], 'ifgram_inversion_patch-bad9c16812020a55a419bb655a7f0a20': ['tcp://127.0.0.1:44304'], 'ifgram_inversion_patch-bfe6723701408d3422e0f28e69128743': ['tcp://127.0.0.1:46139'], 'ifgram_inversion_patch-00a4f90b07a0882bf43cd63c2d915166': ['tcp://127.0.0.1:41779'], 'ifgram_inversion_patch-93ca686e238553fca6764399cfe5e8b2': ['tcp://127.0.0.1:45638']} state: ['memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory'] workers: ['tcp://127.0.0.1:42637', 'tcp://127.0.0.1:45638', 'tcp://127.0.0.1:41419', 'tcp://127.0.0.1:45006', 'tcp://127.0.0.1:41779', 'tcp://127.0.0.1:45677', 'tcp://127.0.0.1:44793', 'tcp://127.0.0.1:41829', 'tcp://127.0.0.1:44304', 'tcp://127.0.0.1:43841', 'tcp://127.0.0.1:46139', 'tcp://127.0.0.1:44732']
NoneType: None
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2acbb081b310>>, <Task finished name='Task-689' coro=<as_completed._track_future() done, defined at /work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/client.py:4441> exception=OSError('Timed out during handshake while connecting to tcp://127.0.0.1:42494 after 30 s')>)
Traceback (most recent call last):
  File "/tmp/rsmas_insar/3rdparty/miniconda3/lib/python3.8/asyncio/tasks.py", line 465, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception: