dragonflyoss / Dragonfly

This repository has be archived and moved to the new repository https://github.com/dragonflyoss/Dragonfly2.
https://d7y.io
Apache License 2.0
6k stars 774 forks source link

failed to get file length from http client i/o timeout #1037

Open zhujian7 opened 5 years ago

zhujian7 commented 5 years ago

Ⅰ. Issue Description

supernode failed to get file length from http client:

2019-10-29 03:43:13.453 INFO sign:8 : success to init local ip of supernode, use ip: 122.168.3.213
2019-10-29 03:43:13.453 INFO sign:8 : start to run supernode
2019-10-29 03:43:40.414 INFO sign:8 : success to register peer &{IP:122.168.3.203 HostName:kube-master-3 Port:0 Version:0.4.3}
2019-10-29 03:43:40.616 INFO sign:8 : success to register peer &{IP:122.168.3.203 HostName:kube-master-3 Port:0 Version:0.4.3}
2019-10-29 03:43:40.810 INFO sign:8 : success to register peer &{IP:122.168.3.203 HostName:kube-master-3 Port:0 Version:0.4.3}
2019-10-29 03:43:41.110 INFO sign:8 : success to register peer &{IP:122.168.3.203 HostName:kube-master-3 Port:0 Version:0.4.3}
2019-10-29 03:43:43.416 ERRO sign:8 : failed to get file length from http client for taskID(fc4ace4d1e109d742a7c3de06d5c0dd768a885022fc23fac095c742cf239e457): failed to get http file Length: Get https://test1.caicloudprivatetest.com/v2/library/nginx/blobs/sha256:faa42fe99fd154460cd5f2174e74b0b004de5a139b7764a990a872f650dc996f: dial tcp: i/o timeout: {"Code":10,"Msg":"unknow error"}
2019-10-29 03:43:43.416 INFO sign:8 : failed to add or update task with req &{CID:122.168.3.203-186-1572320619.311 CallSystem: Dfdaemon:true Filter:[] Headers:map[Authorization:Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsImtpZCI6IlY0TEg6TkpIVDpSQ1FMOkRUUUg6VVBJQjpOREFJOk5EVlg6VkZNMjo2NURQOlQ1Q086TjZBRTpHTUhPIn0.eyJpc3MiOiJoYXJib3ItdG9rZW4taXNzdWVyIiwic3ViIjoiIiwiYXVkIjoiaGFyYm9yLXJlZ2lzdHJ5IiwiZXhwIjoxNTcyMzIyNDE3LCJuYmYiOjE1NzIzMjA2MTcsImlhdCI6MTU3MjMyMDYxNywianRpIjoicTdyWjROcXJRRVdKRmJNQSIsImFjY2VzcyI6W3sidHlwZSI6InJlcG9zaXRvcnkiLCJuYW1lIjoibGlicmFyeS9uZ2lueCIsImFjdGlvbnMiOlsicHVsbCJdfV19.s5VJJOQvWcVFAt-l3n9PV3SJWZT7hnd014a-8XJJrHRPAULWHhUZiOmdU1XojvUxQx1chuOktXi1M3t81y8-QqoYpgfBHjQn-n7hVp4--v8wiSfxvzVa30sqv42bIEaZ8iZPQMEfuY0m6F4u-1hcIuov6I5CyJCJOsx231LL_aZu97Bd5fHGYx2qJJzCjQ7dtJ7wXIIZgV5Mjp6lomVjIl086rldecCL7OXCsFt_jh3D4LfezTf9GJLneieKKZqxa0CAhwSDQOIyPErjaHhLlJrFGCaCOxCwj20QQD7ZAx69ah8wodgjdnzHwnaWbeQC4B4Sukbc-sfICFrAK3JCd4VoIrwvx4QHibcbT6ZUF8N-FCKgvujWa07KXF96ASYKLCilNWiKQMtffTW2URaPcEccOrYRIMbQIpWa4OXIX-nIHvAnAYNuHQf3ywxS-nwRfjlIVhL1p5I86xFTDYts_k0Mt4G8nbna-dGB1-dSmH9C7hJL1JKltIYII7JL8kJdMYCnKiRiGdqTNEu7V7gcfLX7Y1LpwjH6uFyDP-Sso8Uxwlwmh-5NEqDatR_n18ut6d44fTQZ0nkODdBsOVGkMpFqN1e3SoOUv1hClmrXjilvttKBPXVSgPKUP8w6hAByT7Hcgo5tHOSTbcnC-z40ybdWtvLANAfY4du4jhbpZW4 User-Agent:docker/18.09.5 go/go1.10.8 git-commit/e8ff056 kernel/3.10.0-862.11.6.1.el7.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.5 \(linux\)) X-Forwarded-For:127.0.0.1] Identifier: Md5: Path:/peer/file/471acf92-1404-458c-b5ea-9d2024d9971d-186-1572320619.311 PeerID:kube-master-3-122.168.3.203-1572320620414903266 RawURL:https://test1.caicloudprivatetest.com/v2/library/nginx/blobs/sha256:faa42fe99fd154460cd5f2174e74b0b004de5a139b7764a990a872f650dc996f SupernodeIP:122.168.3.213 TaskURL:https://test1.caicloudprivatetest.com/v2/library/nginx/blobs/sha256:faa42fe99fd154460cd5f2174e74b0b004de5a139b7764a990a872f650dc996f}: failed to get http file Length: Get https://test1.caicloudprivatetest.com/v2/library/nginx/blobs/sha256:faa42fe99fd154460cd5f2174e74b0b004de5a139b7764a990a872f650dc996f: dial tcp: i/o timeout: {"Code":10,"Msg":"unknow error"}

but we can ping registry domain manually by wget in the supernode container:

bash-4.4# time wget https://test1.caicloudprivatetest.com/v2/library/nginx/blobs/sha256:faa42fe99fd154460cd5f2174e74b0b004de5a139b7764a990a872f650dc996f
Connecting to test1.caicloudprivatetest.com (122.168.3.218:443)
ssl_client: test1.caicloudprivatetest.com: certificate verification failed: self signed certificate in certificate chain
wget: error getting response: Connection reset by peer

real    0m0.030s
user    0m0.001s
sys     0m0.002s
bash-4.4# 

I found the timeout is set to 4s:

    // send request
    resp, err := HTTPGetTimeout(url, headers, 4*time.Second)
    if err != nil {
        return 0, 0, err
    }

Very confused why was it timeout?

Ⅱ. Describe what happened

Ⅲ. Describe what you expected to happen

Ⅳ. How to reproduce it (as minimally and precisely as possible)

1. 2. 3.

Ⅴ. Anything else we need to know?

Ⅵ. Environment:

yeya24 commented 5 years ago

@zhujian7 Did this error happen multiple times or just happen once?

zhujian7 commented 5 years ago

Did this error happen multiple times or just happen once?

@yeya24 multiple times.

starnop commented 5 years ago

We have fixed that in the master branch. Could you please try again with the master branch? THX

zhujian7 commented 5 years ago

@Starnop I tried with the master branch, but it still appears. Could you please tell me which PR fixed this problem.

gzchen008 commented 4 years ago

I have the same problem

zhujian7 commented 4 years ago

@cgzchen which version did you use? Did you solve the problem?

zhujian7 commented 4 years ago

Some new additional information I found to supply: the environment I tested can not connect to the public network. And I found that the /etc/resolv.conf in the supernode container is:

# cat /etc/resolv.conf

search localdomain

nameserver 8.8.8.8
nameserver 8.8.4.4

and /etc/hosts holds:

....
122.168.3.218 test1.caicloudprivatetest.com
....

I got a conclusion that:

So I changed the /etc/resolv.conf to empty, and the supernode can normally get the file length.

A remained question: what is the difference between the supernode procedure and wget manually?

cc @cgzchen @Starnop

zhujian7 commented 4 years ago

Add RUN test -e /etc/nsswitch.conf || echo 'hosts: files dns' > /etc/nsswitch.conf in the supernode Dockerfile and rebuild a supernode image solved my problem.

zhujian7 commented 4 years ago

/close