foxwhite25 / megaCrawler

A auto restart scrapper
0 stars 7 forks source link

网站采集,第三个疑问。 #115

Closed Homework-DAD closed 1 month ago

Homework-DAD commented 1 month ago

网站

社会保障署 SSS

网址

主页:https://www.sss.gov.ph/ 新闻页:https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases

问题

无法采集文章(以下为运行1713.go的情况) 2024-08-09T09:50:15.240+0800 INFO Running in terminal. 2024-08-09T09:50:15.244+0800 INFO I'm running windows-service. 2024-08-09T09:50:15.245+0800 INFO Listening on:7171 2024-08-09T09:50:15.245+0800 INFO Last scraper will start at 2024-08-09 09:51:18.2397023 +0800 CST m=+63.006208301 2024-08-09T09:50:18.243+0800 INFO Starting engine {"id": "1713"} 2024-08-09T09:50:18.245+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:28.261+0800 INFO Finished engine "1713" 2024-08-09T09:50:28.262+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:29.335+0800 DEBUG Website error tries 10 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 2024-08-09T09:50:29.336+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:39.351+0800 DEBUG Website error tries 9 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": dial tcp: lookup www.sss.gov.ph: no such host 2024-08-09T09:50:39.352+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:40.409+0800 DEBUG Website error tries 8 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 2024-08-09T09:50:40.409+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:50.421+0800 DEBUG Website error tries 7 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": dial tcp: lookup www.sss.gov.ph: no such host 2024-08-09T09:50:50.421+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:51.495+0800 DEBUG Website error tries 6 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 2024-08-09T09:50:51.495+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:58.967+0800 DEBUG Website error tries 5 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": dial tcp: lookup www.sss.gov.ph: no such host 2024-08-09T09:50:58.967+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:58.984+0800 DEBUG Website error tries 4 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": dial tcp: lookup www.sss.gov.ph: no such host 2024-08-09T09:50:58.984+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:58.986+0800 DEBUG Website error tries 3 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": dial tcp: lookup www.sss.gov.ph: no such host 2024-08-09T09:50:58.986+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:58.987+0800 DEBUG Website error tries 2 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": dial tcp: lookup www.sss.gov.ph: no such host 2024-08-09T09:50:58.987+0800 DEBUG Visiting https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases 2024-08-09T09:50:58.987+0800 DEBUG Website error tries 1 for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": dial tcp: lookup www.sss.gov.ph: no such host 2024-08-09T09:50:58.991+0800 ERROR Max retries exceed for https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases: Get "https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases": dial tcp: lookup www.sss.gov.ph: no such host

附件

package dev

import (
    "megaCrawler/crawlers"
    "megaCrawler/extractors"

    "github.com/gocolly/colly/v2"
)

func init() {
    engine := crawlers.Register("1713", "社会保障署", "https://www.sss.gov.ph/")

    engine.SetStartingURLs([]string{"https://www.sss.gov.ph/sss/appmanager/viewArticle.jsp?page=pressreleases"})

    extractorConfig := extractors.Config{
        Author:       true,
        Image:        true,
        Language:     true,
        PublishDate:  true,
        Tags:         true,
        Text:         false,
        Title:        true,
        TextLanguage: "",
    }

    extractorConfig.Apply(engine)

    engine.OnHTML(".Bold > a", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        engine.Visit(element.Attr("href"), crawlers.News)
    })

    engine.OnHTML(".mtp80 > tr > td > p", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        ctx.Content += element.Text
    })

}
foxwhite25 commented 1 month ago

你的dns问题,可以尝试使用谷歌的 8.8.8.8

Homework-DAD commented 1 month ago

以下为修改DNS的运行日志

DNS 1713-Log
Homework-DAD commented 1 month ago

我这边脚本还是不能访问页面。