Open gundalow opened 4 years ago
I've seen this couple of more times in Shippable.
again just now
00:40 Installing 'community.kubernetes:0.10.0' to '/root/.ansible/ansible_collections/community/kubernetes'
00:51 ERROR! Unexpected Exception, this is probably a bug: The read operation timed out
00:51 to see the full traceback, use -vvv
Are there any server-side logs that exist (or could be added) so we can see how often this is occurring? We've hacked around this in Collections CI by attempting the install 3 times, though this just masks the problem.
If we see this again we should mention it in IRC #ansible-galaxy
and @cutwater will look at the logs.
I'm seeing this quite a bit in github actions: https://github.com/cognifloyd/community.mongodb/runs/1130183174?check_suite_focus=true
I've added some retry logic, but that only partially works.
It looks like ansible-galaxy
has a hard-coded 20 second timeout.
I'll go mention it in #ansible-galaxy
Right now it's happening a lot more than usual.
I have run into this randomly too. This happens both while getting collections from galaxy or AH.
I see this a lot in openstack-ansible CI and it's likely the most frequent cause of change-unrelated job failures currently.
We are currently running into the same issue. galaxy.ansible.com seems to answer after about 16 seconds some of the time which causes the timeout in ansible-galaxy command:
-bash$ curl -fL https://galaxy.ansible.com/download/ansible-netcommon-1.3.1-dev6.tar.gz -o ansible-netcommon-1.3.1-dev6.tar.gz; rm ansible-netcommon-1.3.1-dev6.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 211k 100 211k 0 0 157k 0 0:00:01 0:00:01 --:--:-- 157k
-bash$ curl -fL https://galaxy.ansible.com/download/ansible-netcommon-1.3.1-dev6.tar.gz -o ansible-netcommon-1.3.1-dev6.tar.gz; rm ansible-netcommon-1.3.1-dev6.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:15 --:--:-- 0
100 211k 100 211k 0 0 13130 0 0:00:16 0:00:16 --:--:-- 946k
Is it possible to increase the timeout on client side for the time being?
Update from the Ansible side, it appears that someone is scraping galaxy.ansible.com on the hour (every hour) which is causing an increased load and other requests to time out. We are adding some logging in API service to log that from HTTP headers to help identify.
This happens pretty consistently for my use case. Specifically, we have a fork of community.general
that we use to patch a bug in gitlab_runner
. So we configure our requirements:
---
collections:
- name: git+https://github.com/marwatk/community.general.git
type: git
version: gitlab-runner-fix
When we run install, it times out on Installing 'google.cloud:1.0.1'
:
ltheisen@PC:~/git/etl-ansible$ ansible-galaxy collection install --requirements requirements.yml
Starting galaxy collection install process
Process install dependency map
Starting collection install process
Installing 'community.general:2.0.0' to '/mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/community/general'
Created collection for community.general at /mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/community/general
community.general (2.0.0) was installed successfully
Installing 'ansible.netcommon:1.4.1' to '/mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/ansible/netcommon'
Downloading https://galaxy.ansible.com/download/ansible-netcommon-1.4.1.tar.gz to /home/ltheisen/.ansible/tmp/ansible-local-6034j4q6km4e/tmpljq0sbh0
ansible.netcommon (1.4.1) was installed successfully
Installing 'community.kubernetes:1.1.1' to '/mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/community/kubernetes'
Downloading https://galaxy.ansible.com/download/community-kubernetes-1.1.1.tar.gz to /home/ltheisen/.ansible/tmp/ansible-local-6034j4q6km4e/tmpljq0sbh0
community.kubernetes (1.1.1) was installed successfully
Installing 'google.cloud:1.0.1' to '/mnt/c/Users/ltheisen/git/etl-ansible/ansible_collections/google/cloud'
Downloading https://galaxy.ansible.com/download/google-cloud-1.0.1.tar.gz to /home/ltheisen/.ansible/tmp/ansible-local-6034j4q6km4e/tmpljq0sbh0
ERROR! Unexpected Exception, this is probably a bug: The read operation timed out
to see the full traceback, use -vvv
Further compounded by the fact subsequent attempt to install indicates the collection is already installed:
ltheisen@MM233009-PC:~/git/etl-ansible$ ansible-galaxy collection install --requirements requirements.yml
Starting galaxy collection install process
Process install dependency map
Starting collection install process
Skipping 'community.general' as it is already installed
Luckily the transitive deps exist from the original ansible install and we only need the modifications in the collection itself (not its transitive deps), but this makes it hard to automate around because the install command fails so in a script we have to trap/ignore that failure...
Right now (and earlier today), timeouts seem to happen a lot more.
Right now (and earlier today), timeouts seem to happen a lot more.
Getting this since yesterday and all the time.
Is there any status page or api ?
And any workaround ? maybe a sed to change hardcoded value
This is super frustrating since it prevents me from deploying DNS changes using my CI/CD pipelines. Because the collections can't be installed. A status page would actually be quite nice. Or a way to install the collections from a private mirror.
We are seeing this issue and timeouts since this morning happens to be more.
Same here, AWX unable to fetch collection requirements from galaxy with timeouts.
@ntimo if you downloaded the collection tarballs, you can just install them with ansible-galaxy collection install
. Maybe even installing from an URL works, never tested that.
@felixfontein I tried the following requierments.yml for collections:
collections:
- name: git+https://github.com/ansible-collections/community.general
type: git
version: 1.3.1
- name: git+https://github.com/ansible-collections/hetzner.hcloud
type: git
version: 1.2.1
- name: git+https://github.com/ansible-collections/community.zabbix
type: git
version: 1.1.0
but that failed with the following error:
Starting galaxy collection install process
Process install dependency map
ERROR! Unknown error when attempting to call Galaxy at 'https://galaxy.ansible.com/api/v2/collections/ansible/netcommon/versions/?page=9': The read operation timed out
So I will probably also need to install netcommon from GitHub. But this created a "huge" rabbit hole of decencies that you need to include :/ Which is not really user friendly.
I am also presenting this issue when run:
ansible-galaxy collection download awx.awx
it ends with:
ERROR! Unexpected Exception, this is probably a bug: The read operation timed out
The issue has been happening for a while (some months ago) in irregular intervals. Today, however, I've run the ansible-galaxy command several times and all of them return this error. I started to try with ansible-galaxy command when AWX started to fail bringing the collections and consider to download the collections and work with a local copy of them.
Luckily all these dependencies will be gone for community.general 2.0.0 :)
Services behind galaxy.ansible.com were restarted about an hour ago. Also some worker restart thresholds have been increased.
Thank you so much @gundalow . I have run the same ansible-galaxy download command several times today without time out issues.
I've been trying to install community.general
since yesterday evening. I managed to install part of the dependencies straight away, but then:
$ ansible-galaxy collection install -r requirements.yml
Process install dependency map
Starting collection install process
Skipping 'ansible.netcommon' as it is already installed
Skipping 'google.cloud' as it is already installed
Installing 'community.general:1.3.3' to '/home/me/.ansible/collections/ansible_collections/community/general'
ERROR! Unexpected Exception, this is probably a bug: ('The read operation timed out',)
By adding -vvv and wget
ing the actual package url, it looks like there is a redirect to S3, which answers after some delay.
What worked for me was to change the default 10 seconds delay to 30 seconds in open_url
here: https://github.com/ansible/ansible/blob/7f0eb7ad799e531a8fbe5cc4f46046a4b1aeb093/lib/ansible/module_utils/urls.py#L1524.
Isn't 10 seconds a little too optimistic?
A customer also reported this issue and I proposed the modification above to increase the timeout, they could resolve their problem.
Do we raise an RFE to ansible/ansible? If customer can configure a timeout values in ansible.cfg or something like that, it may be helpful.
I think it would be a good idea to add a timeout option to the install command so that we can specify the parameters to pass here (or elsewhere). https://github.com/ansible/ansible/blob/bf7d4ce260dc4ffc6074b2a392b9ff4d3794308b/lib/ansible/galaxy/collection/concrete_artifact_manager.py#L404
Hello, I have the same issue, everyday, at least one of my AWX job fails because of that. Is there any workaround ?
Hitting this more frequently in the last week or so as well.
@VasseurLaurent In my case, I found the code where the installed galaxy uses open_url
and added the timeout=60
parameter directly. I guess you could try grep open_url
first.
Another way is to download the galaxy collection files using another method (curl or wget) that allows you to specify a timeout value, and then use that to install from a local file. I think the URL was printed out in the error message.
btw, a generic remark about request timeouts and dns: my understanding is that above mentioned open_url() is including DNS requests and with default timeout and retry configuration of the classic dns resolver in linux it could take up to 30s until a name is resolved (5s timeouts with 3 dns servers and 2 attempts, see resolv.conf(5)), the request is not sent to the server before the name is resolved, with that in mind 10s (and even 30s) timeout for the whole request seems too low to survive bad dns server health conditions.
After the downtime on ansible galaxy earlier today this has been an issue to the point where I still haven't been able to install all my roles while building a docker container
Hello @knutze , unfortunately , it is embedded in awx dockers , so I don't think changing a part of a source code is a good idea. But thank you a lot for the idea, it can help on manual download.
Is there a planned fix for this? I'm constantly hitting this issue.
Many of my scheduled AWX jobs are ruined because of this. Im also looking for fix or is there any possible workaround?
This also ruins opendev.org CI jobs for openstack, wasting hundreds of hours of donated CPU time. Reality dictates that I now use upstream git repos instead wherever the metadata allows (https://github.com/openstack/openstack-ansible/blob/master/ansible-collection-requirements.yml).
Whatever github are doing is much much more reliable than the galaxy servers. You can even rewrite the collections file in CI to point to local git clones if you want even fewer external dependancies.
this is happening to me with the https://galaxy.ansible.com/ibm/cloudcollection alot today
Starting collection install process
Downloading https://galaxy.ansible.com/download/google-cloud-1.0.2.tar.gz to /home/jenkins/agent/.ansible/tmp/ansible-local-18w1pgjmff/tmpz7dlcsfg/google-cloud-1.0.2-c5moizrr
Installing 'google.cloud:1.0.2' to '/home/jenkins/agent/workspace/sandbox/elijah-aap-shared-library/ansible_collections/google/cloud'
google.cloud:1.0.2 was installed successfully
Downloading https://galaxy.ansible.com/download/azure-azcollection-1.11.0.tar.gz to /home/jenkins/agent/.ansible/tmp/ansible-local-18w1pgjmff/tmpz7dlcsfg/azure-azcollection-1.11.0-8on7ttab
ERROR! Unexpected Exception, this is probably a bug: The read operation timed out
to see the full traceback, use -vvv
put up https://github.com/ansible/ansible/pull/77088 -- reviews welcome
I'm running into this right now. I am getting a CloudFlare branded 504 which means the origin server (Galaxy) gave a gateway timeout.
Running into this probably 4 out of 5 times.
Also running into this right now
Same here. Really holding up some testing when I have a go-live in 2 days! Eek!
Having the same problem here
Just made a duck-tape solution inspired by @kdelee's changes. To use it, install ansible like this:
python3 -m pip install https://github.com/WATonomous/ansible/archive/galaxy_timeout.tar.gz
Turns out if you retry enough times the install eventually succeeds 🤪.
Took a lot of tries to pinpoint where the retry is needed. The code is messy (I added a lot of debugging statements). Take a look at your own risk 😜.
This was the workaround I used to bypass this
- name: Install ansible galaxy collections
ansible.builtin.command:
args:
cmd: ansible-galaxy collection install "{{ item }}"
creates: $HOME/.ansible/collections/ansible_collections/community/{docker,general,hashi_vault,mongodb,mysql}
loop:
- community.mysql
- community.general
- community.hashi_vault
- community.docker
- ansible.posix
- community.mongodb
register: install_ansible_collections
retries: 10
until: install_ansible_collections.rc == 0
This is failing often enough to cause significant CI breakage. Here's a Dockerfile retry loop to try and mitigate this:
ADD requirements.yml /tmp/
RUN \
for i in {5..1}; do \
if ansible-galaxy collection install -p /usr/share/ansible/collections -r /tmp/requirements.yml; then \
break; \
elif [ $i -gt 1 ]; then \
sleep 10; \
else \
exit 1; \
fi; \
done \
&& ansible-galaxy collection list
I had no luck retrying in an infinite loop. It seems completely dead:
RUN until ansible-galaxy collection install \
community.molecule \
community.windows:==1.3.0 \
community.aws:==1.5.0; \
do \
echo "Galaxy failed. Try again"; \
done
Reminder that you can just install collections via git. Just make sure you check the galaxy.yml and also install the contained dependencies.
An infinite retry loop with no back-off/delay or limit will presumably only make the situation worse.
we have the same issue, we have disabled requirement file from our awx.
An infinite retry loop with no back-off/delay or limit will presumably only make the situation worse.
Well, yes and no. Your 1 sec just adds to the timeout. It's not like it's permanently hammering the API. In fact I used to have a sleep 1
there first but dropped it.
At his point any attempt is making the situation worst. That's why I completely dropped galaxy for now and install via git.
If anyone needs an example, this is my quick and dirty solution:
RUN mkdir -p /usr/share/ansible/ansible_collections/community \
/usr/share/ansible/ansible_collections/ansible \
/usr/share/ansible/ansible_collections/amazon && \
cd /usr/share/ansible/ansible_collections/community && \
git clone https://github.com/ansible-collections/community.molecule.git molecule && \
git clone https://github.com/ansible-collections/community.windows.git windows && \
cd windows && git checkout -q v1.3.0 && cd .. && \
git clone https://github.com/ansible-collections/community.aws.git aws && \
cd aws && git checkout -q 1.5.0 && cd .. && \
cd /usr/share/ansible/ansible_collections/ansible && \
git clone https://github.com/ansible-collections/ansible.windows.git windows && \
cd windows && git checkout -q 1.9.0 && cd .. && \
cd /usr/share/ansible/ansible_collections/amazon && \
git clone https://github.com/ansible-collections/amazon.aws.git aws && \
cd aws && git checkout -q 3.1.1
Any updates on this issue? Is this problem very much common to any specific version of ansible?
Bug Report
SUMMARY
We've seen
ERROR! Unexpected Exception, this is probably a bug: ('The read operation timed out',)
(10 minute time out) quite a few times. Size of the collection doesn't seem to be related.Is there any logging on Galaxy to see how common this is?
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS
https://app.shippable.com/github/ansible-collections/community.general/runs/164/3/console