ansible / galaxy

Legacy Galaxy still available as read-only on https://old-galaxy.ansible.com - looking for the new galaxy -> https://github.com/ansible/galaxy_ng
Apache License 2.0
854 stars 328 forks source link

ansible-galaxy collection install timeout #2302

Open gundalow opened 4 years ago

gundalow commented 4 years ago

Bug Report

SUMMARY

We've seen ERROR! Unexpected Exception, this is probably a bug: ('The read operation timed out',) (10 minute time out) quite a few times. Size of the collection doesn't seem to be related.

Is there any logging on Galaxy to see how common this is?

ansible-galaxy -vvv collection install fortinet.fortios
01:49 Downloading https://galaxy.ansible.com/download/fortinet-fortios-1.0.7.tar.gz to /root/.ansible/tmp/ansible-local-666KgfAMW/tmpXSNpnv
# Note 10 minutes have passed
01:59 ERROR! Unexpected Exception, this is probably a bug: ('The read operation timed out',)
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS

https://app.shippable.com/github/ansible-collections/community.general/runs/164/3/console

01:40 + ansible-galaxy -vvv collection install fortinet.fortios
01:43 [WARNING]: You are running the development version of Ansible. You should only
01:43 run Ansible from "devel" if you are modifying the Ansible engine, or trying out
01:43 features under development. This is a rapidly changing source of code and can
01:43 become unstable at any point.
01:43 [DEPRECATION WARNING]: Setting verbosity before the arg sub command is 
01:43 deprecated, set the verbosity after the sub command. This feature will be 
01:43 removed in version 2.13. Deprecation warnings can be disabled by setting 
01:43 deprecation_warnings=False in ansible.cfg.
01:43 ansible-galaxy 2.10.0.dev0
01:43   config file = None
01:43   configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
01:43   ansible python module location = /root/venv/lib/python2.7/site-packages/ansible
01:43   executable location = /root/venv/bin/ansible-galaxy
01:43   python version = 2.7.15+ (default, Feb  9 2019, 11:33:22) [GCC 5.4.0 20160609]
01:43 No config file found; using defaults
01:43 Found installed collection ansible.posix:0.1.1 at '/root/.ansible/ansible_collections/ansible/posix'
01:43 Found installed collection ansible.netcommon:0.0.2 at '/root/.ansible/ansible_collections/ansible/netcommon'
01:43 Found installed collection community.crypto:0.1.0 at '/root/.ansible/ansible_collections/community/crypto'
01:43 Found installed collection community.kubernetes:0.10.0 at '/root/.ansible/ansible_collections/community/kubernetes'
01:43 [WARNING]: Collection at '/root/.ansible/ansible_collections/community/general'
01:43 does not have a MANIFEST.json file, cannot detect version.
01:43 Found installed collection community.general:* at '/root/.ansible/ansible_collections/community/general'
01:43 Found installed collection f5networks.f5_modules:1.2.1 at '/root/.ansible/ansible_collections/f5networks/f5_modules'
01:43 Found installed collection cisco.intersight:1.0.3 at '/root/.ansible/ansible_collections/cisco/intersight'
01:43 Found installed collection cisco.mso:0.0.4 at '/root/.ansible/ansible_collections/cisco/mso'
01:43 Found installed collection check_point.mgmt:1.0.4 at '/root/.ansible/ansible_collections/check_point/mgmt'
01:43 Found installed collection ovirt.ovirt_collection:1.0.1 at '/root/.ansible/ansible_collections/ovirt/ovirt_collection'
01:43 Process install dependency map
01:43 Processing requirement collection 'fortinet.fortios'
01:43 Opened /root/.ansible/galaxy_token
01:45 Collection 'fortinet.fortios' obtained from server default https://galaxy.ansible.com/api/
01:49 Starting collection install process
01:49 Installing 'fortinet.fortios:1.0.7' to '/root/.ansible/ansible_collections/fortinet/fortios'
01:49 Downloading https://galaxy.ansible.com/download/fortinet-fortios-1.0.7.tar.gz to /root/.ansible/tmp/ansible-local-666KgfAMW/tmpXSNpnv
01:59 ERROR! Unexpected Exception, this is probably a bug: ('The read operation timed out',)
01:59 the full traceback was:
01:59 
01:59 Traceback (most recent call last):
01:59   File "/root/venv/bin/ansible-galaxy", line 123, in <module>
01:59     exit_code = cli.run()
01:59   File "/root/venv/lib/python2.7/site-packages/ansible/cli/galaxy.py", line 479, in run
01:59     context.CLIARGS['func']()
01:59   File "/root/venv/lib/python2.7/site-packages/ansible/cli/galaxy.py", line 990, in execute_install
01:59     no_deps, force, force_deps, context.CLIARGS['allow_pre_release'])
01:59   File "/root/venv/lib/python2.7/site-packages/ansible/galaxy/collection.py", line 601, in install_collections
01:59     collection.install(output_path, b_temp_path)
01:59   File "/root/venv/lib/python2.7/site-packages/ansible/galaxy/collection.py", line 203, in install
01:59     self.b_path = self.download(b_temp_path)
01:59   File "/root/venv/lib/python2.7/site-packages/ansible/galaxy/collection.py", line 188, in download
01:59     headers=headers)
01:59   File "/root/venv/lib/python2.7/site-packages/ansible/galaxy/collection.py", line 1105, in _download_file
01:59     unredirected_headers=['Authorization'], http_agent=user_agent())
01:59   File "/root/venv/lib/python2.7/site-packages/ansible/module_utils/urls.py", line 1383, in open_url
01:59     unredirected_headers=unredirected_headers)
01:59   File "/root/venv/lib/python2.7/site-packages/ansible/module_utils/urls.py", line 1288, in open
01:59     return urllib_request.urlopen(request, None, timeout)
01:59   File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
01:59     return opener.open(url, data, timeout)
01:59   File "/usr/lib/python2.7/urllib2.py", line 429, in open
01:59     response = self._open(req, data)
01:59   File "/usr/lib/python2.7/urllib2.py", line 447, in _open
01:59     '_open', req)
01:59   File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
01:59     result = func(*args)
01:59   File "/root/venv/lib/python2.7/site-packages/ansible/module_utils/urls.py", line 448, in https_open
01:59     req
01:59   File "/usr/lib/python2.7/urllib2.py", line 1201, in do_open
01:59     r = h.getresponse(buffering=True)
01:59   File "/usr/lib/python2.7/httplib.py", line 1121, in getresponse
01:59     response.begin()
01:59   File "/usr/lib/python2.7/httplib.py", line 438, in begin
01:59     version, status, reason = self._read_status()
01:59   File "/usr/lib/python2.7/httplib.py", line 394, in _read_status
01:59     line = self.fp.readline(_MAXLINE + 1)
01:59   File "/usr/lib/python2.7/socket.py", line 480, in readline
01:59     data = self._sock.recv(self._rbufsize)
01:59   File "/usr/lib/python2.7/ssl.py", line 772, in recv
01:59     return self.read(buflen)
01:59   File "/usr/lib/python2.7/ssl.py", line 659, in read
01:59     v = self._sslobj.read(len)
01:59 SSLError: ('The read operation timed out',)
felixfontein commented 4 years ago

I've seen this couple of more times in Shippable.

gundalow commented 4 years ago

again just now

00:40 Installing 'community.kubernetes:0.10.0' to '/root/.ansible/ansible_collections/community/kubernetes'
00:51 ERROR! Unexpected Exception, this is probably a bug: The read operation timed out
00:51 to see the full traceback, use -vvv
gundalow commented 4 years ago

Are there any server-side logs that exist (or could be added) so we can see how often this is occurring? We've hacked around this in Collections CI by attempting the install 3 times, though this just masks the problem.

gundalow commented 4 years ago

If we see this again we should mention it in IRC #ansible-galaxy and @cutwater will look at the logs.

cognifloyd commented 4 years ago

I'm seeing this quite a bit in github actions: https://github.com/cognifloyd/community.mongodb/runs/1130183174?check_suite_focus=true

I've added some retry logic, but that only partially works. It looks like ansible-galaxy has a hard-coded 20 second timeout.

https://github.com/ansible/ansible/blob/fa1fb2d13bdf948dc319be57e8465a9ef48c7fe3/lib/ansible/galaxy/api.py#L195-L197

I'll go mention it in #ansible-galaxy

felixfontein commented 4 years ago

Right now it's happening a lot more than usual.

jainnikhil30 commented 4 years ago

I have run into this randomly too. This happens both while getting collections from galaxy or AH.

jrosser commented 4 years ago

I see this a lot in openstack-ansible CI and it's likely the most frequent cause of change-unrelated job failures currently.

hojerst commented 3 years ago

We are currently running into the same issue. galaxy.ansible.com seems to answer after about 16 seconds some of the time which causes the timeout in ansible-galaxy command:

-bash$ curl -fL https://galaxy.ansible.com/download/ansible-netcommon-1.3.1-dev6.tar.gz -o ansible-netcommon-1.3.1-dev6.tar.gz; rm ansible-netcommon-1.3.1-dev6.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  211k  100  211k    0     0   157k      0  0:00:01  0:00:01 --:--:--  157k

-bash$ curl -fL https://galaxy.ansible.com/download/ansible-netcommon-1.3.1-dev6.tar.gz -o ansible-netcommon-1.3.1-dev6.tar.gz; rm ansible-netcommon-1.3.1-dev6.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:15 --:--:--     0
100  211k  100  211k    0     0  13130      0  0:00:16  0:00:16 --:--:--  946k

Is it possible to increase the timeout on client side for the time being?

gundalow commented 3 years ago

Update from the Ansible side, it appears that someone is scraping galaxy.ansible.com on the hour (every hour) which is causing an increased load and other requests to time out. We are adding some logging in API service to log that from HTTP headers to help identify.

lucastheisen commented 3 years ago

This happens pretty consistently for my use case. Specifically, we have a fork of community.general that we use to patch a bug in gitlab_runner. So we configure our requirements:

---
collections:
- name: git+https://github.com/marwatk/community.general.git
  type: git
  version: gitlab-runner-fix

When we run install, it times out on Installing 'google.cloud:1.0.1':

ltheisen@PC:~/git/etl-ansible$ ansible-galaxy collection install --requirements requirements.yml
Starting galaxy collection install process
Process install dependency map
Starting collection install process
Installing 'community.general:2.0.0' to '/mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/community/general'
Created collection for community.general at /mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/community/general
community.general (2.0.0) was installed successfully
Installing 'ansible.netcommon:1.4.1' to '/mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/ansible/netcommon'
Downloading https://galaxy.ansible.com/download/ansible-netcommon-1.4.1.tar.gz to /home/ltheisen/.ansible/tmp/ansible-local-6034j4q6km4e/tmpljq0sbh0
ansible.netcommon (1.4.1) was installed successfully
Installing 'community.kubernetes:1.1.1' to '/mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/community/kubernetes'
Downloading https://galaxy.ansible.com/download/community-kubernetes-1.1.1.tar.gz to /home/ltheisen/.ansible/tmp/ansible-local-6034j4q6km4e/tmpljq0sbh0
community.kubernetes (1.1.1) was installed successfully
Installing 'google.cloud:1.0.1' to '/mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/google/cloud'
Downloading https://galaxy.ansible.com/download/google-cloud-1.0.1.tar.gz to /home/ltheisen/.ansible/tmp/ansible-local-6034j4q6km4e/tmpljq0sbh0
ERROR! Unexpected Exception, this is probably a bug: The read operation timed out
to see the full traceback, use -vvv

Further compounded by the fact subsequent attempt to install indicates the collection is already installed:

ltheisen@MM233009-PC:~/git/etl-ansible$ ansible-galaxy collection install --requirements requirements.yml
Starting galaxy collection install process
Process install dependency map
Starting collection install process
Skipping 'community.general' as it is already installed

Luckily the transitive deps exist from the original ansible install and we only need the modifications in the collection itself (not its transitive deps), but this makes it hard to automate around because the install command fails so in a script we have to trap/ignore that failure...

felixfontein commented 3 years ago

Right now (and earlier today), timeouts seem to happen a lot more.

rhysmeister commented 3 years ago

Right now (and earlier today), timeouts seem to happen a lot more.

Getting this since yesterday and all the time.

ebuildy commented 3 years ago

Is there any status page or api ?

And any workaround ? maybe a sed to change hardcoded value

ntimo commented 3 years ago

This is super frustrating since it prevents me from deploying DNS changes using my CI/CD pipelines. Because the collections can't be installed. A status page would actually be quite nice. Or a way to install the collections from a private mirror.

prasadbiarca commented 3 years ago

We are seeing this issue and timeouts since this morning happens to be more.

WaaZaa666 commented 3 years ago

Same here, AWX unable to fetch collection requirements from galaxy with timeouts.

felixfontein commented 3 years ago

@ntimo if you downloaded the collection tarballs, you can just install them with ansible-galaxy collection install. Maybe even installing from an URL works, never tested that.

ntimo commented 3 years ago

@felixfontein I tried the following requierments.yml for collections:

collections:
- name: git+https://github.com/ansible-collections/community.general
  type: git
  version: 1.3.1
- name: git+https://github.com/ansible-collections/hetzner.hcloud
  type: git
  version: 1.2.1
- name: git+https://github.com/ansible-collections/community.zabbix
  type: git
  version: 1.1.0

but that failed with the following error:

Starting galaxy collection install process
Process install dependency map
ERROR! Unknown error when attempting to call Galaxy at 'https://galaxy.ansible.com/api/v2/collections/ansible/netcommon/versions/?page=9': The read operation timed out

So I will probably also need to install netcommon from GitHub. But this created a "huge" rabbit hole of decencies that you need to include :/ Which is not really user friendly.

areguera commented 3 years ago

I am also presenting this issue when run:

ansible-galaxy collection download awx.awx

it ends with:

ERROR! Unexpected Exception, this is probably a bug: The read operation timed out

The issue has been happening for a while (some months ago) in irregular intervals. Today, however, I've run the ansible-galaxy command several times and all of them return this error. I started to try with ansible-galaxy command when AWX started to fail bringing the collections and consider to download the collections and work with a local copy of them.

felixfontein commented 3 years ago

Luckily all these dependencies will be gone for community.general 2.0.0 :)

gundalow commented 3 years ago

Services behind galaxy.ansible.com were restarted about an hour ago. Also some worker restart thresholds have been increased.

areguera commented 3 years ago

Thank you so much @gundalow . I have run the same ansible-galaxy download command several times today without time out issues.

serl commented 3 years ago

I've been trying to install community.general since yesterday evening. I managed to install part of the dependencies straight away, but then:

$ ansible-galaxy collection install -r requirements.yml
Process install dependency map
Starting collection install process
Skipping 'ansible.netcommon' as it is already installed
Skipping 'google.cloud' as it is already installed
Installing 'community.general:1.3.3' to '/home/me/.ansible/collections/ansible_collections/community/general'
ERROR! Unexpected Exception, this is probably a bug: ('The read operation timed out',)

By adding -vvv and wgeting the actual package url, it looks like there is a redirect to S3, which answers after some delay.

What worked for me was to change the default 10 seconds delay to 30 seconds in open_url here: https://github.com/ansible/ansible/blob/7f0eb7ad799e531a8fbe5cc4f46046a4b1aeb093/lib/ansible/module_utils/urls.py#L1524.

Isn't 10 seconds a little too optimistic?

sugitk commented 3 years ago

A customer also reported this issue and I proposed the modification above to increase the timeout, they could resolve their problem.

Do we raise an RFE to ansible/ansible? If customer can configure a timeout values in ansible.cfg or something like that, it may be helpful.

knutze commented 3 years ago

I think it would be a good idea to add a timeout option to the install command so that we can specify the parameters to pass here (or elsewhere). https://github.com/ansible/ansible/blob/bf7d4ce260dc4ffc6074b2a392b9ff4d3794308b/lib/ansible/galaxy/collection/concrete_artifact_manager.py#L404

VasseurLaurent commented 3 years ago

Hello, I have the same issue, everyday, at least one of my AWX job fails because of that. Is there any workaround ?

mamercad commented 3 years ago

Hitting this more frequently in the last week or so as well.

knutze commented 3 years ago

@VasseurLaurent In my case, I found the code where the installed galaxy uses open_url and added the timeout=60 parameter directly. I guess you could try grep open_url first. Another way is to download the galaxy collection files using another method (curl or wget) that allows you to specify a timeout value, and then use that to install from a local file. I think the URL was printed out in the error message.

mafalb commented 3 years ago

btw, a generic remark about request timeouts and dns: my understanding is that above mentioned open_url() is including DNS requests and with default timeout and retry configuration of the classic dns resolver in linux it could take up to 30s until a name is resolved (5s timeouts with 3 dns servers and 2 attempts, see resolv.conf(5)), the request is not sent to the server before the name is resolved, with that in mind 10s (and even 30s) timeout for the whole request seems too low to survive bad dns server health conditions.

oivindoh commented 3 years ago

After the downtime on ansible galaxy earlier today this has been an issue to the point where I still haven't been able to install all my roles while building a docker container

VasseurLaurent commented 3 years ago

Hello @knutze , unfortunately , it is embedded in awx dockers , so I don't think changing a part of a source code is a good idea. But thank you a lot for the idea, it can help on manual download.

andrew-sumner commented 2 years ago

Is there a planned fix for this? I'm constantly hitting this issue.

Gaabaa commented 2 years ago

Many of my scheduled AWX jobs are ruined because of this. Im also looking for fix or is there any possible workaround?

jrosser commented 2 years ago

This also ruins opendev.org CI jobs for openstack, wasting hundreds of hours of donated CPU time. Reality dictates that I now use upstream git repos instead wherever the metadata allows (https://github.com/openstack/openstack-ansible/blob/master/ansible-collection-requirements.yml).

Whatever github are doing is much much more reliable than the galaxy servers. You can even rewrite the collections file in CI to point to local git clones if you want even fewer external dependancies.

kdelee commented 2 years ago

this is happening to me with the https://galaxy.ansible.com/ibm/cloudcollection alot today

Starting collection install process
Downloading https://galaxy.ansible.com/download/google-cloud-1.0.2.tar.gz to /home/jenkins/agent/.ansible/tmp/ansible-local-18w1pgjmff/tmpz7dlcsfg/google-cloud-1.0.2-c5moizrr
Installing 'google.cloud:1.0.2' to '/home/jenkins/agent/workspace/sandbox/elijah-aap-shared-library/ansible_collections/google/cloud'
google.cloud:1.0.2 was installed successfully
Downloading https://galaxy.ansible.com/download/azure-azcollection-1.11.0.tar.gz to /home/jenkins/agent/.ansible/tmp/ansible-local-18w1pgjmff/tmpz7dlcsfg/azure-azcollection-1.11.0-8on7ttab
ERROR! Unexpected Exception, this is probably a bug: The read operation timed out
to see the full traceback, use -vvv
kdelee commented 2 years ago

put up https://github.com/ansible/ansible/pull/77088 -- reviews welcome

mnaser commented 2 years ago

I'm running into this right now. I am getting a CloudFlare branded 504 which means the origin server (Galaxy) gave a gateway timeout.

cjaiello commented 2 years ago

Running into this probably 4 out of 5 times.

parislarkins commented 2 years ago

Also running into this right now

metalcated commented 2 years ago

Same here. Really holding up some testing when I have a go-live in 2 days! Eek!

buluma commented 2 years ago

Having the same problem here

ben-z commented 2 years ago

Just made a duck-tape solution inspired by @kdelee's changes. To use it, install ansible like this:

python3 -m pip install https://github.com/WATonomous/ansible/archive/galaxy_timeout.tar.gz

Turns out if you retry enough times the install eventually succeeds 🤪.

Took a lot of tries to pinpoint where the retry is needed. The code is messy (I added a lot of debugging statements). Take a look at your own risk 😜.

AMKamel commented 2 years ago

This was the workaround I used to bypass this

    - name: Install ansible galaxy collections
      ansible.builtin.command:
      args:
        cmd: ansible-galaxy collection install "{{ item }}"
        creates: $HOME/.ansible/collections/ansible_collections/community/{docker,general,hashi_vault,mongodb,mysql}
      loop: 
        - community.mysql
        - community.general
        - community.hashi_vault
        - community.docker
        - ansible.posix
        - community.mongodb
      register: install_ansible_collections
      retries: 10
      until: install_ansible_collections.rc == 0
SpComb commented 2 years ago

This is failing often enough to cause significant CI breakage. Here's a Dockerfile retry loop to try and mitigate this:

ADD requirements.yml /tmp/

RUN \
      for i in {5..1}; do \
        if ansible-galaxy collection install -p /usr/share/ansible/collections -r /tmp/requirements.yml; then \
          break; \
        elif [ $i -gt 1 ]; then \
          sleep 10; \
        else \
          exit 1; \
        fi; \
      done \
  &&  ansible-galaxy collection list
udondan commented 2 years ago

I had no luck retrying in an infinite loop. It seems completely dead:

RUN until ansible-galaxy collection install \
        community.molecule \
        community.windows:==1.3.0 \
        community.aws:==1.5.0; \
    do \
        echo "Galaxy failed. Try again"; \
    done

Reminder that you can just install collections via git. Just make sure you check the galaxy.yml and also install the contained dependencies.

SpComb commented 2 years ago

An infinite retry loop with no back-off/delay or limit will presumably only make the situation worse.

samisevenx00 commented 2 years ago

we have the same issue, we have disabled requirement file from our awx.

udondan commented 2 years ago

An infinite retry loop with no back-off/delay or limit will presumably only make the situation worse.

Well, yes and no. Your 1 sec just adds to the timeout. It's not like it's permanently hammering the API. In fact I used to have a sleep 1 there first but dropped it.

At his point any attempt is making the situation worst. That's why I completely dropped galaxy for now and install via git.

If anyone needs an example, this is my quick and dirty solution:

RUN mkdir -p /usr/share/ansible/ansible_collections/community \
             /usr/share/ansible/ansible_collections/ansible \
             /usr/share/ansible/ansible_collections/amazon && \
    cd /usr/share/ansible/ansible_collections/community && \

    git clone https://github.com/ansible-collections/community.molecule.git molecule && \

    git clone https://github.com/ansible-collections/community.windows.git windows && \
    cd windows && git checkout -q v1.3.0 && cd .. && \

    git clone https://github.com/ansible-collections/community.aws.git aws && \
    cd aws && git checkout -q 1.5.0 && cd .. && \

    cd /usr/share/ansible/ansible_collections/ansible && \

    git clone https://github.com/ansible-collections/ansible.windows.git windows && \
    cd windows && git checkout -q 1.9.0 && cd .. && \

    cd /usr/share/ansible/ansible_collections/amazon && \

    git clone https://github.com/ansible-collections/amazon.aws.git aws && \
    cd aws && git checkout -q 3.1.1
siddharthmessi17 commented 2 years ago

Any updates on this issue? Is this problem very much common to any specific version of ansible?