ACCESS-Cloud-Based-InSAR / DockerizedTopsApp

Apache License 2.0
21 stars 2 forks source link

Matplotlib Issue in Error Messages #145

Closed cmarshak closed 1 year ago

cmarshak commented 1 year ago

Describe the bug The recent set of merges to dev produce a matlotlib issue.

Traceback (most recent call last):
  File "/opt/conda/envs/topsapp_env/lib/python3.9/site-packages/matplotlib/__init__.py", line 522, in _get_config_or_cache_dir
    tmpdir = tempfile.mkdtemp(prefix="matplotlib-")
  File "/opt/conda/envs/topsapp_env/lib/python3.9/tempfile.py", line 352, in mkdtemp
    prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir)
  File "/opt/conda/envs/topsapp_env/lib/python3.9/tempfile.py", line 122, in _sanitize_params
    dir = gettempdir()
  File "/opt/conda/envs/topsapp_env/lib/python3.9/tempfile.py", line 291, in gettempdir
    tempdir = _get_default_tempdir()
  File "/opt/conda/envs/topsapp_env/lib/python3.9/tempfile.py", line 223, in _get_default_tempdir
    raise FileNotFoundError(_errno.ENOENT,
FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/ops']
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/opt/conda/envs/topsapp_env/lib/python3.9/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/opt/conda/envs/topsapp_env/lib/python3.9/runpy.py", line 147, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/opt/conda/envs/topsapp_env/lib/python3.9/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/opt/conda/envs/topsapp_env/lib/python3.9/site-packages/isce2_topsapp/__init__.py", line 4, in <module>
    from isce2_topsapp.delivery_prep import prepare_for_delivery
  File "/opt/conda/envs/topsapp_env/lib/python3.9/site-packages/isce2_topsapp/delivery_prep.py", line 9, in <module>
    from matplotlib import cm
  File "/opt/conda/envs/topsapp_env/lib/python3.9/site-packages/matplotlib/__init__.py", line 965, in <module>
    dict.update(rcParams, _rc_params_in_file(matplotlib_fname()))
  File "/opt/conda/envs/topsapp_env/lib/python3.9/site-packages/matplotlib/__init__.py", line 613, in matplotlib_fname
    for fname in gen_candidates():
  File "/opt/conda/envs/topsapp_env/lib/python3.9/site-packages/matplotlib/__init__.py", line 610, in gen_candidates
    yield os.path.join(get_configdir(), 'matplotlibrc')
  File "/opt/conda/envs/topsapp_env/lib/python3.9/site-packages/matplotlib/__init__.py", line 303, in wrapper
    ret = func(**kwargs)
  File "/opt/conda/envs/topsapp_env/lib/python3.9/site-packages/matplotlib/__init__.py", line 556, in get_configdir
    return _get_config_or_cache_dir(_get_xdg_config_dir)
  File "/opt/conda/envs/topsapp_env/lib/python3.9/site-packages/matplotlib/__init__.py", line 524, in _get_config_or_cache_dir
    raise OSError(
OSError: Matplotlib requires access to a writable cache directory, but the default path (/home/ops/.config/matplotlib) is not a writable directory, and a temporary directory could not be created; set the MPLCONFIGDIR environment variable to a writable directory

All the jobs fail rather quickly e.g.

{'job_id': '9eaf520e-698e-46d3-b6a3-fb2b6206b75d',
  'job_type': 'INSAR_ISCE_TEST',
  'request_time': '2023-07-13T20:40:40+00:00',
  'status_code': 'FAILED',
  'user_id': 'cmarshak',
  'name': 'Saudi-Arabia_72_GMAO_0713',
  'job_parameters': {'compute_solid_earth_tide': True,
   'estimate_ionosphere_delay': True,
   'frame_id': 11107,
   'granules': ['S1A_IW_SLC__1SDV_20220125T145707_20220125T145734_041619_04F365_A0A6',
    'S1A_IW_SLC__1SDV_20220125T145732_20220125T145759_041619_04F365_F27F'],
   'secondary_granules': ['S1A_IW_SLC__1SDV_20220113T145708_20220113T145735_041444_04ED75_4324',
    'S1A_IW_SLC__1SDV_20220113T145732_20220113T145759_041444_04ED75_48DD'],
   'weather_model': 'GMAO'},
  'logs': ['https://hyp3-tibet-jpl-contentbucket-81rn23hp7ppf.s3.us-west-2.amazonaws.com/9eaf520e-698e-46d3-b6a3-fb2b6206b75d/9eaf520e-698e-46d3-b6a3-fb2b6206b75d.log'],
  'expiration_time': '2023-07-28T00:00:00+00:00',
  'processing_times': [1.247]}
cmarshak commented 1 year ago

@mgovorcin - when you ran and generated your sample products on leffe for #142, did you manually set OMP_NUM_THREADS?

I have a feeling this is related to running out of memory on the instance. I am seeing similar issues running the current dev branch on leffe and the bug is challenging to track down. I was wondering too if you knew the disk space required for the new updates.

https://github.com/ASFHyP3/hyp3/blob/3645e0ec8fe685724aaf013dbda3350b7f59317c/.github/workflows/deploy-enterprise.yml#L35-L78

If we were to set this environment variable - do you set the number of threads to 4 for the entire workflow or just for iono? Thanks for your help.

cmarshak commented 1 year ago

This issue was found locally and resolved in a PR. The original bug did not go away even once this bug was fixed.


Running the product suggested in #142 (to create the sample product there), I am getting the following error:

...
Masking upper-iono interferograms IW1/04:  62%|██████▏   | 21/34 [00:20<00:10,  1.24it/s]
Masking upper-iono interferograms IW1/05:  62%|██████▏   | 21/34 [00:20<00:10,  1.24it/s]
Masking upper-iono interferograms IW1/05:  65%|██████▍   | 22/34 [00:21<00:09,  1.29it/s]
Masking upper-iono interferograms IW2/01:  65%|██████▍   | 22/34 [00:21<00:09,  1.29it/s]
Masking upper-iono interferograms IW2/01:  68%|██████▊   | 23/34 [00:21<00:08,  1.26it/s]
Masking upper-iono interferograms IW2/02:  68%|██████▊   | 23/34 [00:21<00:08,  1.26it/s]
Masking upper-iono interferograms IW2/02:  71%|███████   | 24/34 [00:23<00:08,  1.12it/s]
Masking upper-iono interferograms IW2/03:  71%|███████   | 24/34 [00:23<00:08,  1.12it/s]
Masking upper-iono interferograms IW2/03:  74%|███████▎  | 25/34 [00:24<00:08,  1.05it/s]
Masking upper-iono interferograms IW2/04:  74%|███████▎  | 25/34 [00:24<00:08,  1.05it/s]
Masking upper-iono interferograms IW2/04:  76%|███████▋  | 26/34 [00:25<00:08,  1.05s/it]
Masking upper-iono interferograms IW2/05:  76%|███████▋  | 26/34 [00:25<00:08,  1.05s/it]
Masking upper-iono interferograms IW2/05:  79%|███████▉  | 27/34 [00:26<00:07,  1.10s/it]
Masking upper-iono interferograms IW2/06:  79%|███████▉  | 27/34 [00:26<00:07,  1.10s/it]
Masking upper-iono interferograms IW2/06:  82%|████████▏ | 28/34 [00:27<00:06,  1.09s/it]
Masking upper-iono interferograms IW3/01:  82%|████████▏ | 28/34 [00:27<00:06,  1.09s/it]
Masking upper-iono interferograms IW3/01:  85%|████████▌ | 29/34 [00:28<00:05,  1.03s/it]
Masking upper-iono interferograms IW3/02:  85%|████████▌ | 29/34 [00:28<00:05,  1.03s/it]
Masking upper-iono interferograms IW3/02:  88%|████████▊ | 30/34 [00:29<00:04,  1.06s/it]
Masking upper-iono interferograms IW3/03:  88%|████████▊ | 30/34 [00:29<00:04,  1.06s/it]
Masking upper-iono interferograms IW3/03:  91%|█████████ | 31/34 [00:30<00:03,  1.09s/it]
Masking upper-iono interferograms IW3/04:  91%|█████████ | 31/34 [00:30<00:03,  1.09s/it]
Masking upper-iono interferograms IW3/04:  94%|█████████▍| 32/34 [00:31<00:02,  1.06s/it]
Masking upper-iono interferograms IW3/05:  94%|█████████▍| 32/34 [00:31<00:02,  1.06s/it]
Masking upper-iono interferograms IW3/05:  97%|█████████▋| 33/34 [00:33<00:01,  1.09s/it]
Masking upper-iono interferograms IW3/06:  97%|█████████▋| 33/34 [00:33<00:01,  1.09s/it]
Masking upper-iono interferograms IW3/06: 100%|██████████| 34/34 [00:34<00:00,  1.20s/it]
Masking upper-iono interferograms IW3/06: 100%|██████████| 34/34 [00:34<00:00,  1.02s/it]

  0%|          | 0/3612 [00:00<?, ?it/s]
  0%|          | 0/3612 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/leffe-data2/cmarshak/mambaforge/envs/topsapp_env/bin/isce2_topsapp", line 8, in <module>
    sys.exit(main())
  File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/__main__.py", line 350, in main
    sys.exit(process_entry_point.load()())
  File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/__main__.py", line 232, in gunw_slc
    nc_path = update_gunw_with_solid_earth_tide(nc_path, "reference", loc_data['reference_orbits'])
  File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/solid_earth_tides.py", line 454, in update_gunw_with_solid_earth_tide
    solid_earth_tide_ds = compute_solid_earth_tide_from_gunw(gunw_path=gunw_path,
  File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/solid_earth_tides.py", line 184, in compute_solid_earth_tide_from_gunw
    tide_e, tide_n, tide_u = compute_enu_solid_earth_tide(orbit_xmls=orbit_xmls,
  File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/solid_earth_tides.py", line 352, in compute_enu_solid_earth_tide
    azimuth_time_arr = get_azimuth_time_array(orbit_xmls,
  File "/mnt/leffe-data2/cmarshak/DockerizedTopsApp/isce2_topsapp/solid_earth_tides.py", line 250, in get_azimuth_time_array
    datetime_isce, rng = orb.geo2rdr([lat, lon, hgt])
  File "/mnt/leffe-data2/cmarshak/mambaforge/envs/topsapp_env/lib/python3.9/site-packages/isce/components/isceobj/Orbit/Orbit.py", line 1048, in geo2rdr
    tguess = tguess - datetime.timedelta(seconds = fn/fnprime)
ValueError: cannot convert float NaN to integer

In leffe, the S1-GUNW is not generated yet i.e. the netcdf data is not there, so I believe that this error is misleading in that the workflow is trying to perform the SET correction and cannot because this portion of the workflow is trying to open a file that is not there.

@mgovorcin - when you have a chance, please re-try your workflow using this dev branch again using the example you provided reinstalling the environment.

Another piece that could be affecting this is the fsspec library we had to add. Maybe that is causing issues with the packaging? I am not sure.

cmarshak commented 1 year ago

So to be clear:

I ran the container locally on our linux server and found a bug that fixed by #146. Now, I reinstall the environment and rerun the workflow on a test using:

OMP_NUM_THREADS=4
isce2_topsapp --secondary-scenes S1A_IW_SLC__1SDV_20220131T222803_20220131T222830_041711_04F690_8F5F \
                                 S1A_IW_SLC__1SDV_20220131T222828_20220131T222855_041711_04F690_28D7  \
              --reference-scenes S1A_IW_SLC__1SDV_20220212T222803_20220212T222830_041886_04FCA3_2B3E \
                                 S1A_IW_SLC__1SDV_20220212T222828_20220212T222855_041886_04FCA3_A3E2  \
              --compute-solid-earth-tide True \
              --frame-id 25502 \
              --estimate-ionosphere-delay True \
              > topsapp_img.out 2> topsapp_img.err\

and the process runs to completion.

When I run:

docker run -ti -v $PWD:/home/ops/topsapp_data --entrypoint /bin/bash ghcr.io/access-cloud-based-insar/dockerizedtopsapp:0.2.4.dev160_g5e9e433
conda activate topsapp_env
cd topsapp_data # home directory
<same command as above with username/password>

Ran out of space on device (might be how docker is installed on our linux server). I am not able to get the bug above in 2 seconds as hyp3 is indicating.

cmarshak commented 1 year ago

And furthermore, the test that I reported failed initially can be run locally as:

isce2_topsapp --reference-scenes S1A_IW_SLC__1SDV_20220125T145707_20220125T145734_041619_04F365_A0A6 \
                                 S1A_IW_SLC__1SDV_20220125T145732_20220125T145759_041619_04F365_F27F  \
              --secondary-scenes S1A_IW_SLC__1SDV_20220113T145708_20220113T145735_041444_04ED75_4324 \
                                 S1A_IW_SLC__1SDV_20220113T145732_20220113T145759_041444_04ED75_48DD  \
              --compute-solid-earth-tide True \
              --frame-id 11107 \
              --estimate-ionosphere-delay True \
              > topsapp_img.out 2> topsapp_img.err\

This too locally does not fail after a few seconds. I do set OMP_NUM_THREADS=4 as indicated above as a courtesy to other users of the server.

cmarshak commented 1 year ago

To answer some of the questions I posed above:

  1. The disk space required for many of the jobs with frames is < 100 GB (standard non-frame based GUNWs typically required more ~130 GB). I believe the memory overhead is less because the geocoding step cuts the final 30 meter products generated by ISCE2 to the frame extent which is smaller than a single S1 frame.
  2. I do not know what the memory require to run unwrapping is; it would be surprising that memory was the culprit since we had run ionosphere corrections previously and this did not cause any problems.
asjohnston-asf commented 1 year ago

@cmarshak , @AndrewPlayer3 on our team offered to try to reproduce the matplotlib error in one of our copies of HyP3 and investigate further today, FYI

AndrewPlayer3 commented 1 year ago

@cmarshak, @asjohnston-asf After running some test jobs in one of our instances using the topsapp test image, I was not able to reproduce this error.

asjohnston-asf commented 1 year ago

I've been able to confirm this is due to the job running out of disk space. Jobs in the hyp3-tibet-jpl deployment are limited to 100 GB or 118 GB of disk space.

We weren't able to reproduce the issue in an ASF environment because we use a different EC2 instance type that gives jobs 237 GB of disk space.

We attempt every job up to three times. The initial attempt fails after running out of disk in ~2 hours. The subsequent two attempts fail immediately with the Matplotlib error because the disk is already full when they start; the retry attempt is starting faster than the docker engine can release the disk from the first attempt. The HyP3 API only reports the log and processing time from the final attempt, which obscured the underlying failure in the first attempt.

I'll forward a log file later today so JPL can review where in the workflow the job is hitting its disk limit.

I recommend reviewing the workflow to reduce it's overall disk space requirement. Investigating whether there's an opportunity to delete intermediate files after they've been used would be a good first step.

We can also provision more disk space for each job if necessary, but that will make the workflow more expensive. I can provide more detail on the tradeoffs if that's an option you'd like to explore.

cmarshak commented 1 year ago

Thank you so much @asjohnston-asf and @AndrewPlayer3.

I don't think it's a good idea to reduce disk space. One of ISCE2's weakness is the intermediate io and managing this is really a big ask.

If we have been doing the expanded disk space for access, let's do it for Tibet.

Questions:

  1. How did you get the 1st job’s log via AWS? Or did retrieving the log require resubmitting the job and carefully monitoring?
  2. Can you control (or inspect) the disk availability of instances via Hyp3 yml files? If so, can you please link?

Part of my confusion was my inability to answer these questions. Super appreciate your help!

asjohnston-asf commented 1 year ago
  1. Individual attempts are logged to AWS Cloudwatch. I typically look up the job by ID in the Step Functions console, then click the link to the underlying AWS Batch job from the list of Events. The details of the Batch job will include a list of attempts, each with a link to an AWS Cloudwatch log stream.

    1. AWS Batch only keeps job history for about a week, so the logs get tougher to track down after that
  2. Disk availability is based on the EC2 instance types used by a particular deployment. The three GUNW deployments use a combination of c6id.xlarge and c5d.xlarge instance types, which come with 118 GB and 100 GB of disk space, respectively.

    1. https://github.com/ASFHyP3/hyp3/blob/develop/.github/workflows/deploy-enterprise.yml#L35
    2. https://aws.amazon.com/ec2/instance-types/ -> Compute Optimized -> C6i/C5 tabs
cmarshak commented 1 year ago

Two more follow up questions:

  1. Can we confirm that the failed jobs are going to the c5d.xlarge? I am not sure how to inspect that.
  2. How did you determine the instance storage available?

The link you shared only says EBS-only for disk space.

cmarshak commented 1 year ago

If that is indeed the case (and we don't need more storage) I created this PR: https://github.com/ASFHyP3/hyp3/pull/1752

Otherwise, we should find instance with more storage because it's not worth the development to manage disk space for this workflow.

cmarshak commented 1 year ago

Also, thank you for the clarification - I see the "Job Attempts" tab under the batch with different log streams for each batch. Super helpful.

cmarshak commented 1 year ago

I can confirm that resubmitting jobs that failed on tibet successfully ran to completion using the ACCESS account (with a few transient exceptions). This means the 18 GB that Andrew cited above would likely fix the issue being reported here.

cmarshak commented 1 year ago

This was resolved using instances with more disk space.