Closed cmarshak closed 1 year ago
@mgovorcin - when you ran and generated your sample products on leffe
for #142, did you manually set OMP_NUM_THREADS
?
I have a feeling this is related to running out of memory on the instance. I am seeing similar issues running the current dev
branch on leffe and the bug is challenging to track down. I was wondering too if you knew the disk space required for the new updates.
If we were to set this environment variable - do you set the number of threads to 4 for the entire workflow or just for iono? Thanks for your help.
This issue was found locally and resolved in a PR. The original bug did not go away even once this bug was fixed.
Running the product suggested in #142 (to create the sample product there), I am getting the following error:
...
Masking upper-iono interferograms IW1/04: 62%|██████▏ | 21/34 [00:20<00:10, 1.24it/s]
Masking upper-iono interferograms IW1/05: 62%|██████▏ | 21/34 [00:20<00:10, 1.24it/s]
Masking upper-iono interferograms IW1/05: 65%|██████▍ | 22/34 [00:21<00:09, 1.29it/s]
Masking upper-iono interferograms IW2/01: 65%|██████▍ | 22/34 [00:21<00:09, 1.29it/s]
Masking upper-iono interferograms IW2/01: 68%|██████▊ | 23/34 [00:21<00:08, 1.26it/s]
Masking upper-iono interferograms IW2/02: 68%|██████▊ | 23/34 [00:21<00:08, 1.26it/s]
Masking upper-iono interferograms IW2/02: 71%|███████ | 24/34 [00:23<00:08, 1.12it/s]
Masking upper-iono interferograms IW2/03: 71%|███████ | 24/34 [00:23<00:08, 1.12it/s]
Masking upper-iono interferograms IW2/03: 74%|███████▎ | 25/34 [00:24<00:08, 1.05it/s]
Masking upper-iono interferograms IW2/04: 74%|███████▎ | 25/34 [00:24<00:08, 1.05it/s]
Masking upper-iono interferograms IW2/04: 76%|███████▋ | 26/34 [00:25<00:08, 1.05s/it]
Masking upper-iono interferograms IW2/05: 76%|███████▋ | 26/34 [00:25<00:08, 1.05s/it]
Masking upper-iono interferograms IW2/05: 79%|███████▉ | 27/34 [00:26<00:07, 1.10s/it]
Masking upper-iono interferograms IW2/06: 79%|███████▉ | 27/34 [00:26<00:07, 1.10s/it]
Masking upper-iono interferograms IW2/06: 82%|████████▏ | 28/34 [00:27<00:06, 1.09s/it]
Masking upper-iono interferograms IW3/01: 82%|████████▏ | 28/34 [00:27<00:06, 1.09s/it]
Masking upper-iono interferograms IW3/01: 85%|████████▌ | 29/34 [00:28<00:05, 1.03s/it]
Masking upper-iono interferograms IW3/02: 85%|████████▌ | 29/34 [00:28<00:05, 1.03s/it]
Masking upper-iono interferograms IW3/02: 88%|████████▊ | 30/34 [00:29<00:04, 1.06s/it]
Masking upper-iono interferograms IW3/03: 88%|████████▊ | 30/34 [00:29<00:04, 1.06s/it]
Masking upper-iono interferograms IW3/03: 91%|█████████ | 31/34 [00:30<00:03, 1.09s/it]
Masking upper-iono interferograms IW3/04: 91%|█████████ | 31/34 [00:30<00:03, 1.09s/it]
Masking upper-iono interferograms IW3/04: 94%|█████████▍| 32/34 [00:31<00:02, 1.06s/it]
Masking upper-iono interferograms IW3/05: 94%|█████████▍| 32/34 [00:31<00:02, 1.06s/it]
Masking upper-iono interferograms IW3/05: 97%|█████████▋| 33/34 [00:33<00:01, 1.09s/it]
Masking upper-iono interferograms IW3/06: 97%|█████████▋| 33/34 [00:33<00:01, 1.09s/it]
Masking upper-iono interferograms IW3/06: 100%|██████████| 34/34 [00:34<00:00, 1.20s/it]
Masking upper-iono interferograms IW3/06: 100%|██████████| 34/34 [00:34<00:00, 1.02s/it]
0%| | 0/3612 [00:00<?, ?it/s]
0%| | 0/3612 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/mnt/leffe-data2/cmarshak/mambaforge/envs/topsapp_env/bin/isce2_topsapp", line 8, in <module>
sys.exit(main())
File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/__main__.py", line 350, in main
sys.exit(process_entry_point.load()())
File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/__main__.py", line 232, in gunw_slc
nc_path = update_gunw_with_solid_earth_tide(nc_path, "reference", loc_data['reference_orbits'])
File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/solid_earth_tides.py", line 454, in update_gunw_with_solid_earth_tide
solid_earth_tide_ds = compute_solid_earth_tide_from_gunw(gunw_path=gunw_path,
File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/solid_earth_tides.py", line 184, in compute_solid_earth_tide_from_gunw
tide_e, tide_n, tide_u = compute_enu_solid_earth_tide(orbit_xmls=orbit_xmls,
File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/solid_earth_tides.py", line 352, in compute_enu_solid_earth_tide
azimuth_time_arr = get_azimuth_time_array(orbit_xmls,
File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/solid_earth_tides.py", line 250, in get_azimuth_time_array
datetime_isce, rng = orb.geo2rdr([lat, lon, hgt])
File "/mnt/leffe-data2/cmarshak/mambaforge/envs/topsapp_env/lib/python3.9/site-packages/isce/components/isceobj/Orbit/Orbit.py", line 1048, in geo2rdr
tguess = tguess - datetime.timedelta(seconds = fn/fnprime)
ValueError: cannot convert float NaN to integer
In leffe, the S1-GUNW is not generated yet i.e. the netcdf data is not there, so I believe that this error is misleading in that the workflow is trying to perform the SET correction and cannot because this portion of the workflow is trying to open a file that is not there.
@mgovorcin - when you have a chance, please re-try your workflow using this dev
branch again using the example you provided reinstalling the environment.
Another piece that could be affecting this is the fsspec
library we had to add. Maybe that is causing issues with the packaging? I am not sure.
So to be clear:
I ran the container locally on our linux server and found a bug that fixed by #146. Now, I reinstall the environment and rerun the workflow on a test using:
OMP_NUM_THREADS=4
isce2_topsapp --secondary-scenes S1A_IW_SLC__1SDV_20220131T222803_20220131T222830_041711_04F690_8F5F \
S1A_IW_SLC__1SDV_20220131T222828_20220131T222855_041711_04F690_28D7 \
--reference-scenes S1A_IW_SLC__1SDV_20220212T222803_20220212T222830_041886_04FCA3_2B3E \
S1A_IW_SLC__1SDV_20220212T222828_20220212T222855_041886_04FCA3_A3E2 \
--compute-solid-earth-tide True \
--frame-id 25502 \
--estimate-ionosphere-delay True \
> topsapp_img.out 2> topsapp_img.err\
and the process runs to completion.
When I run:
docker run -ti -v $PWD:/home/ops/topsapp_data --entrypoint /bin/bash ghcr.io/access-cloud-based-insar/dockerizedtopsapp:0.2.4.dev160_g5e9e433
conda activate topsapp_env
cd topsapp_data # home directory
<same command as above with username/password>
Ran out of space on device (might be how docker is installed on our linux server). I am not able to get the bug above in 2 seconds as hyp3 is indicating.
And furthermore, the test that I reported failed initially can be run locally as:
isce2_topsapp --reference-scenes S1A_IW_SLC__1SDV_20220125T145707_20220125T145734_041619_04F365_A0A6 \
S1A_IW_SLC__1SDV_20220125T145732_20220125T145759_041619_04F365_F27F \
--secondary-scenes S1A_IW_SLC__1SDV_20220113T145708_20220113T145735_041444_04ED75_4324 \
S1A_IW_SLC__1SDV_20220113T145732_20220113T145759_041444_04ED75_48DD \
--compute-solid-earth-tide True \
--frame-id 11107 \
--estimate-ionosphere-delay True \
> topsapp_img.out 2> topsapp_img.err\
This too locally does not fail after a few seconds. I do set OMP_NUM_THREADS=4
as indicated above as a courtesy to other users of the server.
To answer some of the questions I posed above:
@cmarshak , @AndrewPlayer3 on our team offered to try to reproduce the matplotlib error in one of our copies of HyP3 and investigate further today, FYI
@cmarshak, @asjohnston-asf After running some test jobs in one of our instances using the topsapp test image, I was not able to reproduce this error.
I've been able to confirm this is due to the job running out of disk space. Jobs in the hyp3-tibet-jpl deployment are limited to 100 GB or 118 GB of disk space.
We weren't able to reproduce the issue in an ASF environment because we use a different EC2 instance type that gives jobs 237 GB of disk space.
We attempt every job up to three times. The initial attempt fails after running out of disk in ~2 hours. The subsequent two attempts fail immediately with the Matplotlib error because the disk is already full when they start; the retry attempt is starting faster than the docker engine can release the disk from the first attempt. The HyP3 API only reports the log and processing time from the final attempt, which obscured the underlying failure in the first attempt.
I'll forward a log file later today so JPL can review where in the workflow the job is hitting its disk limit.
I recommend reviewing the workflow to reduce it's overall disk space requirement. Investigating whether there's an opportunity to delete intermediate files after they've been used would be a good first step.
We can also provision more disk space for each job if necessary, but that will make the workflow more expensive. I can provide more detail on the tradeoffs if that's an option you'd like to explore.
Thank you so much @asjohnston-asf and @AndrewPlayer3.
I don't think it's a good idea to reduce disk space. One of ISCE2's weakness is the intermediate io and managing this is really a big ask.
If we have been doing the expanded disk space for access, let's do it for Tibet.
Questions:
Part of my confusion was my inability to answer these questions. Super appreciate your help!
Individual attempts are logged to AWS Cloudwatch. I typically look up the job by ID in the Step Functions console, then click the link to the underlying AWS Batch job from the list of Events. The details of the Batch job will include a list of attempts, each with a link to an AWS Cloudwatch log stream.
Disk availability is based on the EC2 instance types used by a particular deployment. The three GUNW deployments use a combination of c6id.xlarge
and c5d.xlarge
instance types, which come with 118 GB and 100 GB of disk space, respectively.
Two more follow up questions:
The link you shared only says EBS-only
for disk space.
If that is indeed the case (and we don't need more storage) I created this PR: https://github.com/ASFHyP3/hyp3/pull/1752
Otherwise, we should find instance with more storage because it's not worth the development to manage disk space for this workflow.
Also, thank you for the clarification - I see the "Job Attempts" tab under the batch with different log streams for each batch. Super helpful.
I can confirm that resubmitting jobs that failed on tibet successfully ran to completion using the ACCESS account (with a few transient exceptions). This means the 18 GB that Andrew cited above would likely fix the issue being reported here.
This was resolved using instances with more disk space.
Describe the bug The recent set of merges to
dev
produce a matlotlib issue.All the jobs fail rather quickly e.g.