apptainer / singularity

Singularity has been renamed to Apptainer as part of us moving the project to the Linux Foundation. This repo has been persisted as a snapshot right before the changes.
https://github.com/apptainer/apptainer
Other
2.53k stars 424 forks source link

Check validity of downloads #1091

Closed vsoch closed 6 years ago

vsoch commented 6 years ago

We should, as a sanity check, make sure that docker layers (and other bits) are downloaded entirely and completely before we dump them into the image. I've seen a few issues come up that (possibly?) are related to a bad download, meaning one that completes (but is corrupt for some other reason).

dtrudg commented 6 years ago

In Slack @vsoch came across this issue again - a cloudflare banned page had been returned, with 200 success. Content check is needed to address this, not just status code checks.

vsoch commented 6 years ago

This is a fun investigation :) Here is the full response:

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Access denied | production.cloudflare.docker.com used Cloudflare to restrict access</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">body{margin:0;padding:0}</style>
<!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script><!--<![endif]-->
<!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script><!--<![endif]-->
</head>
<body>
  <div id="cf-wrapper">
    <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
    <div id="cf-error-details" class="cf-error-details-wrapper">
      <div class="cf-wrapper cf-header cf-error-overview">
        <h1>
          <span class="cf-error-type" data-translate="error">Error</span>
          <span class="cf-error-code">1010</span>
          <small class="heading-ray-id">Ray ID: 40ca50f5684e37ce &bull; 2018-04-16 23:08:51 UTC</small>
        </h1>
        <h2 class="cf-subheadline">Access denied</h2>
      </div><!-- /.header -->
      <section></section><!-- spacer -->
      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="what_happened">What happened?</h2>
            <p>The owner of this website (production.cloudflare.docker.com) has banned your access based on your browser's signature (40ca50f5684e37ce-ua48).</p>
          </div>

        </div>
      </div><!-- /.section -->
      <div class="cf-error-footer cf-wrapper">
  <p>
    <span class="cf-footer-item">Cloudflare Ray ID: <strong>40ca50f5684e37ce</strong></span>
    <span class="cf-footer-separator">&bull;</span>
    <span class="cf-footer-item"><span>Your IP</span>: 96.10.226.142</span>
    <span class="cf-footer-separator">&bull;</span>
    <span class="cf-footer-item"><span>Performance &amp; security by</span> <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link" target="_blank">Cloudflare</a></span>

  </p>
</div><!-- /.error-footer -->
    </div><!-- /#cf-error-details -->
  </div><!-- /#cf-wrapper -->
  <script type="text/javascript">
  window._cf_translation = {};

</script>
</body>
</html>

I think the number 40ca50f5684e37ce-ua48 is a specific id (the first part before the dash) tied to a user agent code, one from the cloudflare lookup that docker production is using. It looks a lot like one of these rules --> https://api.cloudflare.com/#user-agent-blocking-rules-list-useragent-rules

The Ray ID looks like something that is passed around that would help us get support, if needed --> https://support.cloudflare.com/hc/en-us/articles/200169746-Adding-the-CF-RAY-header-to-your-logs.

If I search that page for error code 1010 there are three options:

This one --> this --> https://support.cloudflare.com/hc/en-us/articles/200171806-Error-1010-The-owner-of-this-website-has-banned-your-access-based-on-your-browser-s-signature and the error code matches too. The weird dance through proxys and removing headers has me thinking this is probably it. actually matches most closely, but it isn't super informative.

I know that requests automatically adds a user agent (Python Requests something) but I couldn't figure out what old school urllib and urllib2 would return, especially after all the ways it's passed around and used in Singularity. It could be any of the following:

  1. If the User-Agent is just missing, we could try adding one, one that is specific to the user at the time (so if installed in a container it wouldn't be shared)
  2. If the User Agent is present but too common and then reaching some rate limit, we could still try changing it, but maybe to one of the top used ones so it's not flagged as likely to be a bot. My thinking behind this one is that I was making requests from a very base (commonly used) container... and if everyone is making requests from that container, or even one large user, it would make sense to be blocked.
  3. If the issue is that the request is going through a redirect, and the redirect always passes on the same user agent, that would be an issue for the service providing the redirect, because all users would be then sharing that user agent.

So some sanity checks to try -

  1. Trace the user agent through the requests / responses
  2. Determine if a user agent is consistent between different hosts in the "same" container, or on a single host and a container used by it.
  3. Add a custom user agent, one that is most common --> https://techblog.willshouse.com/2012/01/03/most-common-user-agents/
vsoch commented 6 years ago

Also pinging @thiell here because he helped with the investigation above! (And might be interested too).

dtrudg commented 6 years ago

It would be valid to set a custom user-agent, but I'm not sure that would change things much here unless you are making requests from a common IP shared with loads of people - and those people also happen to be making requests to Docker hub using python urllib.

AFAIK the default user agent for urllib only really varies based on the version of python installed at the location you use it.

Redirection shouldn't be an issue - when a redirect response is issued, the original client makes a new request (sending its own user-agent) directly to the referred address. A redirect isn't a 'passed on' request, it results in a brand new request by the originator. I think what you are getting at is proxying - If you happen to be sitting behind a web proxy then it's a lot easier to hit rate limits, as all users will be going out from a small set of IPs.

Where are you doing this singularity pulls from @v? How many do you think you are doing? Also are you using any authentication to the Docker registry? There is a known bug with a PR fix there - which can easily hit rate limits if you make a lot of pulls and happened to specify invalid credentials #1406

vsoch commented 6 years ago

The pulls are from the Tunel container --> https://singularityhub.github.io/interface and I have hit it twice, always with a vanilla ubuntu container. If I go away for a few hours and come back it usually works again! I really wasn't doing more than maybe a few an hour, for testing different views. I first thought it was something in Tunel, but when I went to the base command line in the container I still had the issue. I was (hoping) it was something related to running Singularity in Docker, and some common algorithm that anyone using the same ubuntu container might then generate the same user agent? There definitely isn't any authentication here, beyond just standard pull stuffs.

dtrudg commented 6 years ago

Hi @v - I meant where is Tunel running? Are you on a University network behind a shared proxy? On a cloud service? etc. When you say you still had the issue in the base command line, do you mean just doing stuff from the command line triggered it? Or did the problem remain after it had become apparent when using Tunel?

This is a tough one to reason out if you are only doing a few pulls per hour. I'm yet to trigger a Cloudflare block doing anything, and I built about 2000 Singularity containers from docker pulls over a weekend once from a residential IP, and the same thing in a shorter time-span from a University IP.

User agent looks like this.... Python-urllib/2.7 it's the same for everyone using the same version of Python. If you are blocked it's almost certainly associated with at least user agent + IP address though.

vsoch commented 6 years ago

oh! Of course! One time I was on shared public wireless, the second time on my own wireless, which is possibly provided by the same company (but different router of course). It was on my computer Ubuntu 16.04), using Docker version 18.03.0-ce, build 0520e24 and the latest development branch 2.x. When I say command line, I mean the Docker container command line where Singularity is installed (not directly on the host). This was the first time I've seen it, which is why I thought it was something specific about being in a Docker container. But most stuffs is working ok! :)

dtrudg commented 6 years ago

Okay - I have a theory on what might be leading to cloudflare bans here.

The python code to pull docker images is abusing the authentication token mechanism of the docker registry pretty badly.

  1. Initial request related to the pull happens, gets a 401 as a token is needed
  2. update_token does it's stuff, requesting a 9000s expiry token - this is as it should be, but 9000s is crazy long.
  3. Manifests are retrieved.
  4. multiprocessing of layer downloads starts
  5. After each layer download update_token is called, even if we are only 0.5s into the 9000s token expiry. Note that because of multiprocessing these calls can be concurrent

If I pull e.g. dctrud/docker-aufs-sanity there are 5 small layers and it takes less than 3 seconds to get all of them, but Singularity is making 9 requests for auth tokens when there should only be one.

This type of behaviour could definitely be grounds for a temporary API ban.

We need 2 fixes here:

  1. Update token shouldn't request a new token if the original hasn't yet expired
  2. We really need to checksum downloaded layers, and throw up an error at point of download if checksum doesn't match.
vsoch commented 6 years ago

This was just reported again by @sbutcher

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Access denied | production.cloudflare.docker.com used Cloudflare to restrict access</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">body{margin:0;padding:0}</style>
<!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script><!--<![endif]-->
<!--[if gte IE 10]><!--><script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script><!--<![endif]-->
</head>
<body>
  <div id="cf-wrapper">
    <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
    <div id="cf-error-details" class="cf-error-details-wrapper">
      <div class="cf-wrapper cf-header cf-error-overview">
        <h1>
          <span class="cf-error-type" data-translate="error">Error</span>
          <span class="cf-error-code">1010</span>
          <small class="heading-ray-id">Ray ID: 40d8311b28e00a8a &bull; 2018-04-18 15:33:47 UTC</small>
        </h1>
        <h2 class="cf-subheadline">Access denied</h2>
      </div><!-- /.header -->
      <section></section><!-- spacer -->
      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="what_happened">What happened?</h2>
            <p>The owner of this website (production.cloudflare.docker.com) has banned your access based on your browser's signature (40d8311b28e00a8a-ua48).</p>
          </div>

        </div>
      </div><!-- /.section -->
      <div class="cf-error-footer cf-wrapper">
  <p>
    <span class="cf-footer-item">Cloudflare Ray ID: <strong>40d8311b28e00a8a</strong></span>
    <span class="cf-footer-separator">&bull;</span>
    <span class="cf-footer-item"><span>Your IP</span>: xxx.xx.x.xxx</span>
    <span class="cf-footer-separator">&bull;</span>
    <span class="cf-footer-item"><span>Performance &amp; security by</span> <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link" target="_blank">Cloudflare</a></span>

  </p>
</div><!-- /.error-footer -->
    </div><!-- /#cf-error-details -->
  </div><!-- /#cf-wrapper -->
  <script type="text/javascript">
  window._cf_translation = {};
vsoch commented 6 years ago

@dctrud I think that must be it! +1 on either of those solutions, I would only refresh given expired.

dtrudg commented 6 years ago

Closing - this was included in 2.5.0.