ai2cm / fv3config

Manipulate FV3GFS run directories
Apache License 2.0
1 stars 0 forks source link

run_docker now broken in latest GCR image #50

Open nbren12 opened 4 years ago

nbren12 commented 4 years ago

After some changes @oliverwm1 pushed today, the GCR image is now broken. Here is a minimum working example:

python -m fv3config.fv3run  gs://vcm-ml-data/2020-01-15-noahb-exploration/2hr_strong_dampingone_step_config/C48/20160805.000000/fv3config.yml rundir  --dockerimage us.gcr.io/vcm-ml/fv3gfs-python
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/fv3gfs-python/external/fv3config/fv3config/fv3run/__main__.py", line 85, in <module>
    sys.exit(main())
  File "/fv3gfs-python/external/fv3config/fv3config/fv3run/__main__.py", line 51, in main
    run(args.config, args.outdir, args.runfile, args.dockerimage, args.keyfile)
  File "/fv3gfs-python/external/fv3config/fv3config/fv3run/__main__.py", line 81, in run
    run_native(config_dict_or_location, outdir, runfile=runfile)
  File "/fv3gfs-python/external/fv3config/fv3config/fv3run/_native.py", line 42, in run_native
    write_run_directory(config_dict, localdir)
  File "/fv3gfs-python/external/fv3config/fv3config/config/rundir.py", line 17, in write_run_directory
    write_assets_to_directory(config, target_directory)
  File "/fv3gfs-python/external/fv3config/fv3config/_asset_list.py", line 186, in write_assets_to_directory
    write_asset_list(asset_list, target_directory)
  File "/fv3gfs-python/external/fv3config/fv3config/_asset_list.py", line 192, in write_asset_list
    write_asset(asset, target_directory)
  File "/fv3gfs-python/external/fv3config/fv3config/_asset_list.py", line 175, in write_asset
    filesystem.get_file(source_path, target_path, cache=True)
  File "/fv3gfs-python/external/fv3config/fv3config/filesystem.py", line 88, in get_file
    _get_file_cached(source_filename, dest_filename)
  File "/fv3gfs-python/external/fv3config/fv3config/filesystem.py", line 102, in _get_file_cached
    os.makedirs(os.path.dirname(cache_location), exist_ok=True)
  File "/usr/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 3 more times]
  File "/usr/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/inputdata/fv3config-cache/gs'
Traceback (most recent call last):
  File "/home/noahb/miniconda3/envs/fv3net/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/noahb/miniconda3/envs/fv3net/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/noahb/miniconda3/envs/fv3net/lib/python3.7/site-packages/fv3config/fv3run/__main__.py", line 85, in <module>
    sys.exit(main())
  File "/home/noahb/miniconda3/envs/fv3net/lib/python3.7/site-packages/fv3config/fv3run/__main__.py", line 51, in main
    run(args.config, args.outdir, args.runfile, args.dockerimage, args.keyfile)
  File "/home/noahb/miniconda3/envs/fv3net/lib/python3.7/site-packages/fv3config/fv3run/__main__.py", line 78, in run
    keyfile=keyfile,
  File "/home/noahb/miniconda3/envs/fv3net/lib/python3.7/site-packages/fv3config/fv3run/_docker.py", line 57, in run_docker
    + python_args
  File "/home/noahb/miniconda3/envs/fv3net/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['docker', 'run', '-v', '/home/noahb/workspace/fv3-python/examples/rundir:/outdir', '-v', '/home/noahb/keys/noahb-vm.json:/gcs_key.json', '--rm', '--user', '1000:1000', '-e', 'GOOGLE_APPLICATION_CREDENTIALS=/gcs_key.json', 'us.gcr.io/vcm-ml/fv3gfs-python', 'python3', '-m', 'fv3config.fv3run', 'gs://vcm-ml-data/2020-01-15-noahb-exploration/2hr_strong_dampingone_step_config/C48/20160805.000000/fv3config.yml', '/outdir']' returned non-zero exit status 1.

@mcgibbon explained the problem on slack in this way

the issue there is that fv3config now writes to the cache directory but your user doesn’t have permission to write there the fix would be to use some kind of directory like /tmp that anyone can write under so this one’s a regression bug I introduced with the caching, but also I never updated fv3gfs-python to use the newer fv3config so I’m a little confused as to how that image got set up that way might be Oli pushed an image today the core issue behind all this is we should probably version our images

mcgibbon commented 4 years ago

Fix will be pretty simple, just have to change cache directory from /inputdata to something any user should be able to access like /tmp/inputdata.

Edit: better yet /var/cache/inputdata.

Edit2: or even symlink /inputdata to /var/cache/inputdata.

mcgibbon commented 4 years ago

Note this bug won't actually occur if you build an image from the latest fv3gfs-python master branch, because the new fv3config changes have not yet been checked out in fv3gfs-python. But this has to be fixed before then.

oliverwm1 commented 4 years ago

Sorry this caught you @nbren12. I pushed a fv3gfs-python image that was built my fix/oversubscribe fv3config branch, which included the new caching changes. What's the best approach in terms of tagging/versioning images?

nbren12 commented 4 years ago

I just looked up the exact "digest" like this:

 gcloud container images list-tags --format='json' us.gcr.io/vcm-ml/fv3gfs-python

And then used an image name like this:

docker run us.gcr.io/vcm-ml/fv3gfs-python@<digest>

That might be a good pattern going forward.

nbren12 commented 4 years ago

@mcgibbon Didn't the data used to go in appdirs? That might a more robust cross platform solution than using folders like the ones you mention.

mcgibbon commented 4 years ago

It still uses appdirs. The docker image just uses an environment variable to override the directory. The appdirs default is the user cache directory, so it would cause fv3run to fail (that was actually when we made the environment variable override).

mcgibbon commented 4 years ago

I think this is part of a larger issue that we should start versioning our products in general. At least, fv3config and fv3gfs-python should probably have proper major.minor.bugfix versions. Then fv3gfs-python images pushed by circleci would be named e.g. us.gcr.io/vcm-ml/fv3gfs-python:0.1.2. When we make images manually we would push with some other tag (e.g. us.gcr.io/vcm-ml/fv3gfs-python:0.1.2-oliwm).

If you want to pin a workflow to a particular image, using the digest is a good way to go.