geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
62 stars 23 forks source link

Dask not working with current python environment #280

Closed falkamelung closed 3 years ago

falkamelung commented 4 years ago

I installed the python environments (new and old) in your area (test/test2 and test/testold). Some description of the issue is also at https://github.com/insarlab/MintPy/issues/165 . Since then somebody confirmed that it does not work under PBS either. David's effort are documented at https://github.com/2gotgrossman/dask-rsmas-presentation . As I said, I spent lots of time to install the old environment using the old requirements files but did not get it work.

_(After opening the issue it occurred to me that this is not a rsmasinsar but a MintPy issue. The MintPy environment is simpler as it does not have ISCE. I did install a mintpy python environment (run s.bmintpy). It gives the same problem @yunjunz is also interested in this.).

First run test data with the old (good) python environment (in 3rdparty dir using ln -s /projects/scratch/insarlab/famelung/MINICONDA3_GOOD miniconda3):

s.btestold
cd /projects/scratch/insarlab/jaz101/unittestGalapagosSenDT128/mintpy
rm -rf wor* time* S1* *velo* *lock
ifgram_inversion.py /projects/scratch/insarlab/famelung/unittestGalapagosSenDT128/mintpy/inputs/ifgramStack.h5 -t /projects/scratch/insarlab/famelung/unittestGalapagosSenDT128/mintpy/smallbaselineApp.cfg --update

You will see the following output on the screen. Once you see the line 'FUTURE #1...` that means the first worker has completed its job.

/nethome/jaz101/test/testold/rsmas_insar/3rdparty/miniconda3/lib/python3.7/site-packages/distributed/deploy/local.py:106: UserWarning: diagnostics_port has been deprecated. Please use `dashboard_address=` instead
  "diagnostics_port has been deprecated. "
JOB COMMAND CALLED FROM PYTHON: #!/bin/bash

#BSUB -J mintpy_bee
#BSUB -q general
#BSUB -P insarlab
#BSUB -n 2
#BSUB -R "span[hosts=1]"
#BSUB -M 4000
#BSUB -W 00:15
#BSUB -R "rusage[mem=2500]"
#BSUB -o worker_mintpy.%J.o
#BSUB -e worker_mintpy.%J.e
JOB_ID=${LSB_JOBID%.*}

/nethome/jaz101/test/testold/rsmas_insar/3rdparty/miniconda3/bin/python3 -m distributed.cli.dask_worker tcp://10.11.1.13:43577 --nthreads 2 --memory-limit 4.00GB --name mintpy_bee--${JOB_ID}-- --death-timeout 60 --interface ib0

0 [0, 0, 34, 1100]
1 [34, 0, 68, 1100]
2 [68, 0, 102, 1100]
3 [102, 0, 136, 1100]
4 [136, 0, 170, 1100]
5 [170, 0, 204, 1100]
6 [204, 0, 238, 1100]
7 [238, 0, 272, 1100]
8 [272, 0, 306, 1100]
9 [306, 0, 341, 1100]
10 [341, 0, 375, 1100]
11 [375, 0, 409, 1100]
12 [409, 0, 443, 1100]
13 [443, 0, 477, 1100]
14 [477, 0, 511, 1100]
15 [511, 0, 545, 1100]
16 [545, 0, 579, 1100]
17 [579, 0, 613, 1100]
18 [613, 0, 647, 1100]
19 [647, 0, 682, 1100]
20 [682, 0, 716, 1100]
21 [716, 0, 750, 1100]
22 [750, 0, 784, 1100]
23 [784, 0, 818, 1100]
24 [818, 0, 852, 1100]
25 [852, 0, 886, 1100]
26 [886, 0, 920, 1100]
27 [920, 0, 954, 1100]
28 [954, 0, 988, 1100]
29 [988, 0, 1023, 1100]
30 [1023, 0, 1057, 1100]
31 [1057, 0, 1091, 1100]
32 [1091, 0, 1125, 1100]
33 [1125, 0, 1159, 1100]
34 [1159, 0, 1193, 1100]
35 [1193, 0, 1227, 1100]
36 [1227, 0, 1261, 1100]
37 [1261, 0, 1295, 1100]
38 [1295, 0, 1329, 1100]
39 [1329, 0, 1364, 1100]
FUTURE #1 complete in 22.41086196899414 seconds. Box: [1329, 0, 1364, 1100] Time: 1576907836.4373446
FUTURE #2 complete in 22.615032196044922 seconds. Box: [341, 0, 375, 1100] Time: 1576907836.641515
FUTURE #3 complete in 23.67570185661316 seconds. Box: [204, 0, 238, 1100] Time: 1576907837.7021844
FUTURE #4 complete in 23.899144649505615 seconds. Box: [750, 0, 784, 1100] Time: 1576907837.9256275
FUTURE #5 complete in 33.91184902191162 seconds. Box: [1193, 0, 1227, 1100] Time: 1576907847.9383318
FUTURE #6 complete in 34.35519361495972 seconds. Box: [409, 0, 443, 1100] Time: 1576907848.3816762
FUTURE #7 complete in 34.4250373840332 seconds. Box: [1295, 0, 1329, 1100] Time: 1576907848.45152
FUTURE #8 complete in 34.43126916885376 seconds. Box: [1091, 0, 1125, 1100] Time: 1576907848.4577518
FUTURE #9 complete in 34.47014904022217 seconds. Box: [102, 0, 136, 1100] Time: 1576907848.4966319
FUTURE #10 complete in 34.49504494667053 seconds. Box: [1261, 0, 1295, 1100] Time: 1576907848.5215275
FUTURE #11 complete in 34.52674746513367 seconds. Box: [716, 0, 750, 1100] Time: 1576907848.55323
FUTURE #12 complete in 34.56773853302002 seconds. Box: [613, 0, 647, 1100] Time: 1576907848.5942214
FUTURE #13 complete in 34.633689165115356 seconds. Box: [579, 0, 613, 1100] Time: 1576907848.6601717
FUTURE #14 complete in 34.643046855926514 seconds. Box: [954, 0, 988, 1100] Time: 1576907848.6695294
FUTURE #15 complete in 34.737093687057495 seconds. Box: [1125, 0, 1159, 1100] Time: 1576907848.7635763
FUTURE #16 complete in 34.85588765144348 seconds. Box: [784, 0, 818, 1100] Time: 1576907848.8823705
FUTURE #17 complete in 34.89444017410278 seconds. Box: [1023, 0, 1057, 1100] Time: 1576907848.920923
FUTURE #18 complete in 34.98618984222412 seconds. Box: [272, 0, 306, 1100] Time: 1576907849.0126724
FUTURE #19 complete in 34.988025426864624 seconds. Box: [0, 0, 34, 1100] Time: 1576907849.0145073
FUTURE #20 complete in 35.06926655769348 seconds. Box: [852, 0, 886, 1100] Time: 1576907849.0957494
FUTURE #21 complete in 35.12981843948364 seconds. Box: [920, 0, 954, 1100] Time: 1576907849.1563005
FUTURE #22 complete in 35.1398344039917 seconds. Box: [511, 0, 545, 1100] Time: 1576907849.1663165
FUTURE #23 complete in 35.14792537689209 seconds. Box: [545, 0, 579, 1100] Time: 1576907849.1744082
FUTURE #24 complete in 35.26082181930542 seconds. Box: [1057, 0, 1091, 1100] Time: 1576907849.2873044
FUTURE #25 complete in 35.325475454330444 seconds. Box: [818, 0, 852, 1100] Time: 1576907849.3519585
FUTURE #26 complete in 35.357988357543945 seconds. Box: [682, 0, 716, 1100] Time: 1576907849.384471
FUTURE #27 complete in 35.36216115951538 seconds. Box: [443, 0, 477, 1100] Time: 1576907849.388643
FUTURE #28 complete in 35.36387801170349 seconds. Box: [477, 0, 511, 1100] Time: 1576907849.3903596
FUTURE #29 complete in 35.44611406326294 seconds. Box: [1329, 0, 1364, 1100] Time: 1576907849.4725966
FUTURE #30 complete in 35.512518644332886 seconds. Box: [988, 0, 1023, 1100] Time: 1576907849.5390012
FUTURE #31 complete in 35.62356638908386 seconds. Box: [1159, 0, 1193, 1100] Time: 1576907849.6500492
FUTURE #32 complete in 35.67281889915466 seconds. Box: [647, 0, 682, 1100] Time: 1576907849.6993017
FUTURE #33 complete in 35.694395303726196 seconds. Box: [375, 0, 409, 1100] Time: 1576907849.7208781
FUTURE #34 complete in 35.860108613967896 seconds. Box: [306, 0, 341, 1100] Time: 1576907849.8865912
FUTURE #35 complete in 35.878817319869995 seconds. Box: [886, 0, 920, 1100] Time: 1576907849.9052997
FUTURE #36 complete in 35.90852355957031 seconds. Box: [1227, 0, 1261, 1100] Time: 1576907849.9350061
FUTURE #37 complete in 36.021509885787964 seconds. Box: [34, 0, 68, 1100] Time: 1576907850.0479925
FUTURE #38 complete in 36.410053968429565 seconds. Box: [170, 0, 204, 1100] Time: 1576907850.4365366
FUTURE #39 complete in 36.676669120788574 seconds. Box: [68, 0, 102, 1100] Time: 1576907850.703151
FUTURE #40 complete in 36.9715633392334 seconds. Box: [136, 0, 170, 1100] Time: 1576907850.9980462
--------------------------------------------------
converting phase to range
calculating perpendicular baseline timeseries
...

To run the current (new) python environment (installed in /3rparty dir as described in https://github.com/geodesymiami/rsmas_insar/blob/master/docs/installation.md#installation-guide ) just do (after clearing your old environment) using

s.bnew

and the same commands above. You will see the screen output below, but the FUTURE #1 will never show up. If you run bjobs you will see that the worker have been started but the don't run. They stop after the time-out period of 30 minutes.

/nethome/jaz101/test/test2/rsmas_insar/3rdparty/miniconda3/bin/python3 -m distributed.cli.dask_worker tcp://10.11.1.13:44169 --nthreads 2 --memory-limit 4.00GB --name mintpy_bee--${JOB_ID}-- --death-timeout 60 --interface ib0

0 [0, 0, 34, 1100]
1 [34, 0, 68, 1100]
2 [68, 0, 102, 1100]
3 [102, 0, 136, 1100]
4 [136, 0, 170, 1100]
5 [170, 0, 204, 1100]
6 [204, 0, 238, 1100]
7 [238, 0, 272, 1100]
8 [272, 0, 306, 1100]
9 [306, 0, 341, 1100]
10 [341, 0, 375, 1100]
11 [375, 0, 409, 1100]
12 [409, 0, 443, 1100]
13 [443, 0, 477, 1100]
14 [477, 0, 511, 1100]
15 [511, 0, 545, 1100]
16 [545, 0, 579, 1100]
17 [579, 0, 613, 1100]
18 [613, 0, 647, 1100]
19 [647, 0, 682, 1100]
20 [682, 0, 716, 1100]
21 [716, 0, 750, 1100]
22 [750, 0, 784, 1100]
23 [784, 0, 818, 1100]
24 [818, 0, 852, 1100]
25 [852, 0, 886, 1100]
26 [886, 0, 920, 1100]
27 [920, 0, 954, 1100]
28 [954, 0, 988, 1100]
29 [988, 0, 1023, 1100]
30 [1023, 0, 1057, 1100]
31 [1057, 0, 1091, 1100]
32 [1091, 0, 1125, 1100]
33 [1125, 0, 1159, 1100]
34 [1159, 0, 1193, 1100]
35 [1193, 0, 1227, 1100]
36 [1227, 0, 1261, 1100]
37 [1261, 0, 1295, 1100]
38 [1295, 0, 1329, 1100]
39 [1329, 0, 1364, 1100]

^Z
[1]+  Stopped                 ifgram_inversion.py /projects/scratch/insarlab/famelung/unittestGalapagosSenDT128/mintpy/inputs/ifgramStack.h5 -t /projects/scratch/insarlab/famelung/unittestGalapagosSenDT128/mintpy/smallbaselineApp.cfg --update
//login3/projects/scratch/insarlab/jaz101/unittestGalapagosSenDT128/mintpy[1004] bjobs
JOBID     USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
23155989  jaz101  RUN   general    login3      2*n264      mintpy_bee Dec 21 01:16
23155991  jaz101  RUN   general    login3      2*n276      mintpy_bee Dec 21 01:16
23155990  jaz101  RUN   general    login3      2*n259      mintpy_bee Dec 21 01:16
23155994  jaz101  RUN   general    login3      2*n267      mintpy_bee Dec 21 01:16
Ovec8hkin commented 4 years ago

What's the actual error you get from Dask?

falkamelung commented 4 years ago

It does not give any error. The jobs start but don’t run. The jobs stop when the walltime is over.

falkamelung commented 4 years ago

The way to debug this may be to run one of the examples that Dask comes with and see whether it also fails with our environment.

falkamelung commented 3 years ago

reloved a while ago