foxwhite25 / megaCrawler

A auto restart scrapper
0 stars 7 forks source link

网站采集,第二个疑问。 #114

Closed Homework-DAD closed 3 months ago

Homework-DAD commented 3 months ago

网站

国家反贫困委员会 NAPC 国家文化和艺术委员会 NCCA

网址

https://napc.gov.ph/ https://www.ncca.gov.ph/

网站xml

https://napc.gov.ph/post-sitemap.xml None

问题

网站有人机验证,脚本被禁止访问(以下为运行1709.go的情况) 2024-08-09T09:33:39.928+0800 INFO Running in terminal. 2024-08-09T09:33:39.934+0800 INFO I'm running windows-service. 2024-08-09T09:33:39.934+0800 INFO Listening on:7171 2024-08-09T09:33:39.935+0800 INFO Last scraper will start at 2024-08-09 09:34:42.927693 +0800 CST m=+63.011845901 2024-08-09T09:33:42.931+0800 INFO Starting engine {"id": "1709"} 2024-08-09T09:33:42.931+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.555+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.637+0800 DEBUG Website error tries 10 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.637+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.709+0800 DEBUG Website error tries 9 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.717+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.798+0800 DEBUG Website error tries 8 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.798+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.892+0800 DEBUG Website error tries 7 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.892+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:43.975+0800 DEBUG Website error tries 6 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:43.975+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.103+0800 DEBUG Website error tries 5 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.103+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.206+0800 DEBUG Website error tries 4 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.206+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.286+0800 DEBUG Website error tries 3 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.286+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.370+0800 DEBUG Website error tries 2 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.371+0800 DEBUG Visiting https://www.ncca.gov.ph/ 2024-08-09T09:33:44.451+0800 DEBUG Website error tries 1 for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:44.451+0800 ERROR Max retries exceed for https://www.ncca.gov.ph/: Forbidden 2024-08-09T09:33:47.944+0800 INFO Finished engine "1709"

附件

package dev

import (
    "megaCrawler/crawlers"
    "megaCrawler/extractors"
    "strings"

    "github.com/gocolly/colly/v2"
)

func init() {
    engine := crawlers.Register("1708", "国家反贫困委员会", "https://napc.gov.ph/")

    engine.SetStartingURLs([]string{"https://napc.gov.ph/post-sitemap.xml"})

    extractorConfig := extractors.Config{
        Author:       true,
        Image:        true,
        Language:     true,
        PublishDate:  true,
        Tags:         true,
        Text:         true,
        Title:        true,
        TextLanguage: "",
    }

    extractorConfig.Apply(engine)

    engine.OnXML("//loc", func(element *colly.XMLElement, ctx *crawlers.Context) {
        if strings.Contains(ctx.URL, "request-for-quotation") {
        } else {
            engine.Visit(element.Text, crawlers.News)
        }
    })

}

package dev

import (
    "megaCrawler/crawlers"
    "megaCrawler/extractors"

    "github.com/gocolly/colly/v2"
)

func init() {
    engine := crawlers.Register("1709", "国家文化和艺术委员会", "https://www.ncca.gov.ph/")

    engine.SetStartingURLs([]string{"https://www.ncca.gov.ph/"})

    extractorConfig := extractors.Config{
        Author:       true,
        Image:        true,
        Language:     true,
        PublishDate:  true,
        Tags:         true,
        Text:         true,
        Title:        true,
        TextLanguage: "",
    }

    extractorConfig.Apply(engine)

    engine.OnHTML(".moretag", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        engine.Visit(element.Attr("href"), crawlers.News)
    })

    engine.OnHTML(".nav-previous > a", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        engine.Visit(element.Attr("href"), crawlers.Index)
    })

}
``
Homework-DAD commented 3 months ago

附加浏览器访问网页时的人机验证。

验证人机1 验证人机2
foxwhite25 commented 3 months ago

第一个页面是cloudflare可以不用做了丢到error就好了,反爬的目的就是让成本过高不值得爬取 第二个页面本身就是寄了的,也不需要处理