awslabs / mlmax

Example templates for the delivery of custom ML solutions to production so you can get started quickly without having to make too many design choices.
https://mlmax.readthedocs.io/en/latest/
Apache License 2.0
66 stars 19 forks source link

Disable public yum repos on new DLAMI EC2 instance #53

Closed verdimrc closed 3 years ago

verdimrc commented 3 years ago

🐛 Bug report

Describe the bug

From the EC2 instance deployed by the environment module, sudo yum update will time-out on public yum repos. These commands were required to disable those repos:

  120  sudo yum-config-manager --disable libnvidia-container
  122  sudo yum-config-manager --disable neuron
  124  sudo yum-config-manager --disable libnvidia-container --disable neuron
  126  sudo yum-config-manager --disable nvidia-container-runtime
  128  sudo yum-config-manager --disable nvidia-docker

To reproduce

sudo yum update and watch the timeout message. E.g.,

ec2-user@ip-xx-xxx-xx-xxx pkgs]$ sudo yum update
Loaded plugins: dkms-build-requires, extras_suggestions, langpacks, priorities, update-motd, versionlock
amzn2-core                                                                                                | 3.7 kB  00:00:00     
amzn2extra-docker                                                                                         | 3.0 kB  00:00:00     
https://nvidia.github.io/nvidia-container-runtime/stable/amzn2/x86_64/repodata/repomd.xml: [Errno 14] curl#7 - "Failed to connect to nvidia.github.io port 443: Connection timed out"
Trying other mirror.
^C
 Current download cancelled, interrupt (ctrl-c) again within two seconds
to exit.

 One of the configured repositories failed (nvidia-container-runtime),
 and yum doesn't have enough cached data to continue. At this point the only
 safe thing yum can do is fail. There are a few ways to work "fix" this:

     1. Contact the upstream for the repository and get them to fix the problem.

     2. Reconfigure the baseurl/etc. for the repository, to point to a working
        upstream. This is most often useful if you are using a newer
        distribution release than is supported by the repository (and the
        packages for the previous distribution release still work).

     3. Run the command with the repository temporarily disabled
            yum --disablerepo=nvidia-container-runtime ...

     4. Disable the repository permanently, so yum won't use it by default. Yum
        will then just ignore the repository until you permanently enable it
        again or use --enablerepo for temporary usage:

            yum-config-manager --disable nvidia-container-runtime
        or
            subscription-manager repos --disable=nvidia-container-runtime

     5. Configure the failing repository to be skipped, if it is unavailable.
        Note that yum will try to contact the repo. when it runs most commands,
        so will have to try and fail each time (and thus. yum will be be much
        slower). If it is a very temporary problem though, this is often a nice
        compromise:

            yum-config-manager --save --setopt=nvidia-container-runtime.skip_if_unavailable=true

failure: repodata/repomd.xml from nvidia-container-runtime: [Errno 256] No more mirrors to try.
https://nvidia.github.io/nvidia-container-runtime/stable/amzn2/x86_64/repodata/repomd.xml: [Errno 14] curl#7 - "Failed to connect to nvidia.github.io port 443: Connection timed out"
https://nvidia.github.io/nvidia-container-runtime/stable/amzn2/x86_64/repodata/repomd.xml: [Errno 15] user interrupt

Expected behavior

Skip public yum repos.

System information

josiahdavis commented 3 years ago

Sorry, accidentally hit close :(

yihyap commented 3 years ago

Recommend to add a new section in readme with command to do security patching as follows.

sudo yum-config-manager --disable libnvidia-container
sudo yum-config-manager --disable neuron
sudo yum-config-manager --disable nvidia-container-runtime
sudo yum-config-manager --disable nvidia-docker
sudo yum update-minimal --sec-severity=critical,important --bugfix
github-actions[bot] commented 3 years ago

This issue is stale. If left untouched, it will be automatically closed in 7 days.

josiahdavis commented 3 years ago

@verdimrc has this issue been addressed satisfactorily?

verdimrc commented 3 years ago

+1. We're good.