cncf / landscapeapp

🌄Upstream landscape generation application
https://landscapes.dev/
Apache License 2.0
255 stars 125 forks source link

Use curl to check homepage urls #584

Closed jordinl closed 4 years ago

jordinl commented 4 years ago

We use puppeteer to check that all homepage urls work and to see if there are redirects. This process is extremely slow, I think it takes about 3 hours on the CNCF landscape. We could use curl, something like the code below takes under a minute:

const { exec } = require('child_process')
const { readFileSync } = require('fs')

const curl = url => {
    return new Promise(resolve => {
        const curlOptions = [
            '--fail',
            '--location',
            '--silent',
            '--insecure ',
            '--max-time 20',
            '--output /dev/null',
            '-H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15)"',
            '-H "Connection: keep-alive"',
            '-H "Accept: text/html, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q=0.8"',
            '--write-out "{\\"effectiveUrl\\":\\"%{url_effective}\\",\\"status\\":\\"%{http_code}\\"}"'
        ].join(' ')
        const command = `curl ${curlOptions} ${url}`
        exec(command, (error, stdout) => {
            const { effectiveUrl, status } = JSON.parse(stdout)
            if (error && status !== '403') {
                resolve({ success: false, status: status || 'UNKNOWN' })
            } else {
                resolve({ effectiveUrl, success: true })
            }
        })
    })
}
dankohn commented 4 years ago

The issue is that some sites do a JavaScript redirect instead of a 302. But, why don't you run against our current landscapes and see if that still matters.

jordinl commented 4 years ago

@dankohn Some of these javascript redirects are not that interesting... Not sure why we care that foo.com redirects to foo.com/en. Also, with javascript we might not be able to catch a 404 page.

I ran the checks with the curl command I pasted above and it seems that it actually gets more accurate results compared against the daily ran for CNCF. Note that there might be some discrepancies because some sites my redirect or block based on IP.

url Puppeteer Curl Result
apex.run GOOD ERROR Curl
harmonycloud.cn harmonycloud.cn/overindex GOOD Puppeteer
cisco.com/go/containers GOOD cisco.com/c/en/us/products/cloud-systems-management/container-platform/index.html Curl
harmonycloud.cn harmonycloud.cn/overindex GOOD Puppeteer
osci.kr osci.kr:443 osci.kr:443/main.php Curl
yanrongyun.com yanrongyun.com/en-us GOOD Puppeteer
yanrongyun.com/solution/k8s-storage GOOD 404 Curl
amadeus.com/en 403 GOOD Curl
bitnami.com/stacksmith GOOD bitnami.com/tanzu-application-catalog Curl
cloud.google.com/stackdriver GOOD cloud.google.com/products/operations Curl
cloud.netapp.com/kubernetes-service GOOD cloud.netapp.com/project-astra Curl
cloudzone.io GOOD ERROR Curl
github.com/Comcast/trickster GOOD github.com/tricksterproxy/trickster Curl
github.com/vmware-tanzu/gimbal GOOD github.com/projectcontour/gimbal Curl
inlets.dev GOOD docs.inlets.dev Curl
jet.hazelcast.org GOOD jet-start.sh Curl
juxt.pro/crux/index.html GOOD opencrux.com Curl
mae.sh GOOD containo.us/maesh Curl
miaoyun.io GOOD miaoyun.net.cn Curl
msystechnologies.com/cloud-coe GOOD msystechnologies.com Browser shows 404 page
nks.netapp.io GOOD ERROR Curl
platformer.com/services/training/kubernetes TIMEOUT ERROR Both
redkubes.com/services GOOD 404 Curl
robin.io/product/hyper-converged-kubernetes GOOD robin.io/news/hyper-converged-kubernetes-platform-2 robin.io/resources/robin-platform-two-minute-video
spotinst.com/products/spotinst-functions GOOD spot.io/products/spotinst-functions Curl
sysdig.com/products/secure GOOD sysdig.com/products/kubernetes-security Curl
traefik.io GOOD containo.us/traefik Curl
twitter.com twitter.com/explore GOOD Puppeteer
vexxhost.com/public-cloud/container-services-kubernetes GOOD vexxhost.com/public-cloud/container-services-certified-kubernetes Curl
163yun.com 163yun.com/product-nsf GOOD Puppeteer
2ndquadrant.com/en/services/kubernetes-orchestration-for-highly-available-postgresql-and-bdr GOOD 2ndquadrant.com/en/resources/kubernetes-operators-for-highly-available-postgresql-and-bdr Curl
adidas.com/us 403 GOOD Curl
altoros.com/kubernetes-consulting GOOD altoros.com/services/kubernetes-consulting Curl
aporeto.com GOOD paloaltonetworks.com Curl
binaris.com GOOD reshuffle.com/index.html Curl
chef.io chef.io/home GOOD Curl
cloudbees.com/products/cloudbees-codeship GOOD cloudbees.com/products/codeship/overview Curl
cloudflare.com/products/cloudflare-workers GOOD workers.cloudflare.com Curl
clyso.com clyso.com/en GOOD Puppeteer
curve.app/en curve.com/en curve.com Curl
dellemc.com/en-us/services/consulting-services/cloud-native-applications.htm GOOD delltechnologies.com/en-us/services/consulting-services/cloud-native-applications.htm Curl
dellemc.com/en-us/storage/data-storage.htm GOOD delltechnologies.com/en-us/storage/data-storage.htm Curl
desotech.it/formazione-2/devops-2 deso.tech/formazione-2/devops-2 404 Curl
elastic.co/products/apm GOOD elastic.co/apm Curl
elastic.co/products/beats GOOD elastic.co/beats Curl
elastic.co/products/logstash GOOD elastic.co/logstash Curl
fidelity.com 403 GOOD Curl
godaddy.com GOOD es.godaddy.com Puppeteer
guardicore.com/workload-protection-hybrid-cloud GOOD guardicore.com/cloud-security-platform 404
hedvig.io GOOD commvault.com/software-defined-storage Curl
ibm.com/cloud/kubernetes-service GOOD ibm.com/cloud/container-service Curl
iherb.com hk.iherb.com es.iherb.com Puppeteer
indeed.com hk.indeed.com/?r=us es.indeed.com/?r=us Puppeteer
infinidat.com GOOD infinidat.com/en Curl
infracloud.io GOOD ERROR Puppeteer
irondb.io GOOD circonus.com Curl
ksyun.com/post/product/KCE GOOD ksyun.com/post/product/KCE.html Curl
mirantis.com/software/kubernetes GOOD mirantis.com/software/mcp/kubernetes Curl
nasdaq.com 403 GOOD Curl
opensds.io GOOD sodafoundation.io Curl
oracle.com/linux/cloudnative GOOD oracle.com/it-infrastructure/software.html Curl
ovh.ie/public-cloud/kubernetes GOOD ovhcloud.com/en-ie/public-cloud/kubernetes Curl
pachyderm.io GOOD pachyderm.com Curl
paloaltonetworks.com/cloud-security GOOD paloaltonetworks.com/prisma/cloud Curl
skybet.com m.skybet.com/failover/blocking_page.html 403 Both
talend.com/products/data-streams-free-edition GOOD 404 Curl
underarmour.com/en-us 403 GOOD Curl
unitedhealthgroup.com TIMEOUT GOOD Curl
uswitch.com ERROR GOOD Curl
verizonmedia.com GOOD consent.yahoo.com/collectConsent Puppeteer
vmware.com vmware.com/hk.html GOOD Curl
zalando.de en.zalando.de/?_rfl=de GOOD Puppeteer
ogis-ri.co.jp INVALID GOOD Curl