Closed Homework-DAD closed 3 months ago
国家反贫困委员会 NAPC 国家文化和艺术委员会 NCCA
https://napc.gov.ph/ https://www.ncca.gov.ph/
https://napc.gov.ph/post-sitemap.xml None
网站有人机验证,脚本被禁止访问(以下为运行1709.go的情况) 2024-08-09T09:33:39.928+0800 INFO Running in terminal. 2024-08-09T09:33:39.934+0800 INFO I'm running windows-service. 2024-08-09T09:33:39.934+0800 INFO Listening on:7171 2024-08-09T09:33:39.935+0800 INFO Last scraper will start at 2024-08-09 09:34:42.927693 +0800 CST m=+63.011845901 2024-08-09T09:33:42.931+0800 INFO Starting engine {"id": "1709"} 2024-08-09T09:33:42.931+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.555+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.637+0800 DEBUG Website error tries 10 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.637+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.709+0800 DEBUG Website error tries 9 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.717+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.798+0800 DEBUG Website error tries 8 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.798+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.892+0800 DEBUG Website error tries 7 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.892+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.975+0800 DEBUG Website error tries 6 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.975+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.103+0800 DEBUG Website error tries 5 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.103+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.206+0800 DEBUG Website error tries 4 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.206+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.286+0800 DEBUG Website error tries 3 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.286+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.370+0800 DEBUG Website error tries 2 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.371+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.451+0800 DEBUG Website error tries 1 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.451+0800 ERROR Max retries exceed for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:47.944+0800 INFO Finished engine "1709"
package dev import ( "megaCrawler/crawlers" "megaCrawler/extractors" "strings" "github.com/gocolly/colly/v2" ) func init() { engine := crawlers.Register("1708", "国家反贫困委员会", "https://napc.gov.ph/") engine.SetStartingURLs([]string{"https://napc.gov.ph/post-sitemap.xml"}) extractorConfig := extractors.Config{ Author: true, Image: true, Language: true, PublishDate: true, Tags: true, Text: true, Title: true, TextLanguage: "", } extractorConfig.Apply(engine) engine.OnXML("//loc", func(element *colly.XMLElement, ctx *crawlers.Context) { if strings.Contains(ctx.URL, "request-for-quotation") { } else { engine.Visit(element.Text, crawlers.News) } }) }
package dev import ( "megaCrawler/crawlers" "megaCrawler/extractors" "github.com/gocolly/colly/v2" ) func init() { engine := crawlers.Register("1709", "国家文化和艺术委员会", "https://www.ncca.gov.ph/") engine.SetStartingURLs([]string{"https://www.ncca.gov.ph/"}) extractorConfig := extractors.Config{ Author: true, Image: true, Language: true, PublishDate: true, Tags: true, Text: true, Title: true, TextLanguage: "", } extractorConfig.Apply(engine) engine.OnHTML(".moretag", func(element *colly.HTMLElement, ctx *crawlers.Context) { engine.Visit(element.Attr("href"), crawlers.News) }) engine.OnHTML(".nav-previous > a", func(element *colly.HTMLElement, ctx *crawlers.Context) { engine.Visit(element.Attr("href"), crawlers.Index) }) } ``
附加浏览器访问网页时的人机验证。
第一个页面是cloudflare可以不用做了丢到error就好了,反爬的目的就是让成本过高不值得爬取 第二个页面本身就是寄了的,也不需要处理
网站
国家反贫困委员会 NAPC 国家文化和艺术委员会 NCCA
网址
https://napc.gov.ph/ https://www.ncca.gov.ph/
网站xml
https://napc.gov.ph/post-sitemap.xml None
问题
网站有人机验证,脚本被禁止访问(以下为运行1709.go的情况) 2024-08-09T09:33:39.928+0800 INFO Running in terminal. 2024-08-09T09:33:39.934+0800 INFO I'm running windows-service. 2024-08-09T09:33:39.934+0800 INFO Listening on:7171 2024-08-09T09:33:39.935+0800 INFO Last scraper will start at 2024-08-09 09:34:42.927693 +0800 CST m=+63.011845901 2024-08-09T09:33:42.931+0800 INFO Starting engine {"id": "1709"} 2024-08-09T09:33:42.931+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.555+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.637+0800 DEBUG Website error tries 10 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.637+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.709+0800 DEBUG Website error tries 9 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.717+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.798+0800 DEBUG Website error tries 8 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.798+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.892+0800 DEBUG Website error tries 7 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.892+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.975+0800 DEBUG Website error tries 6 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.975+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.103+0800 DEBUG Website error tries 5 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.103+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.206+0800 DEBUG Website error tries 4 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.206+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.286+0800 DEBUG Website error tries 3 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.286+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.370+0800 DEBUG Website error tries 2 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.371+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.451+0800 DEBUG Website error tries 1 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.451+0800 ERROR Max retries exceed for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:47.944+0800 INFO Finished engine "1709"
附件