foxwhite25 / megaCrawler

A auto restart scrapper
0 stars 7 forks source link

关于Bad Request错误 #130

Closed Terminal017 closed 1 month ago

Terminal017 commented 1 month ago

网站 URL

1、域名为:www.thenews.com.pk 2、采集的网站:https://www.thenews.com.pk/tns/

错误情况

1、这个网站在采集到一定网站数量后会产生Bad Request的错误(如图)导致网站全部无法访问,在我的电脑上测试时发现它似乎是固定采集了1200多篇网站后报错。(StartURLs那里一个URL约200篇新闻) 2、尝试通过更改采集的网站顺序和数量,并尝试添加time.sleep进行延迟,但它似乎依旧在同样数量后报错。 3、询问可能的原因和解决方法。

代码

package dev

import (
    "megaCrawler/crawlers"
    "megaCrawler/extractors"
    "strings"

    "github.com/gocolly/colly/v2"
)

func init() {
    //这网站的主域名为www.thenews.com.pk,它只提供当日的新闻,所以采取采集内部的一个专门提供新闻的模块
    engine := crawlers.Register("1450", "国际新闻", "https://www.thenews.com.pk/tns/")

    engine.SetStartingURLs([]string{
        "https://www.thenews.com.pk/tns/category/interviews",
        "https://www.thenews.com.pk/tns/category/dialogue",
        "https://www.thenews.com.pk/tns/category/special-report",
        "https://www.thenews.com.pk/tns/category/art-culture",
        "https://www.thenews.com.pk/tns/category/literati",
        "https://www.thenews.com.pk/tns/category/footloose", //我的电脑测试的最大正常数量
        // "https://www.thenews.com.pk/tns/category/political-economy",
        // "https://www.thenews.com.pk/tns/category/sports",
        // "https://www.thenews.com.pk/tns/category/shehr",
        // "https://www.thenews.com.pk/tns/category/fashion",
        // "https://www.thenews.com.pk/tns/category/encore",
        // "https://www.thenews.com.pk/tns/category/instep",
        // "https://www.thenews.com.pk/tns/category/in-the-picture",
    })

    extractorConfig := extractors.Config{
        Author:       true,
        Image:        true,
        Language:     true,
        PublishDate:  true,
        Tags:         true,
        Text:         false,
        Title:        true,
        TextLanguage: "",
    }

    extractorConfig.Apply(engine)

    engine.OnHTML(".w_c_left > div > ul > li > a", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        engine.Visit(element.Attr("href"), crawlers.News)
    })

    engine.OnHTML(`.pagination_category > a[rel = "next"]`, func(element *colly.HTMLElement, ctx *crawlers.Context) {
        engine.Visit(element.Attr("href"), crawlers.Index)
    })

    engine.OnHTML(".authorFullName > a", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        ctx.Authors = append(ctx.Authors, strings.TrimSpace(element.Text))
    })

    engine.OnHTML("div.detail-time", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        ctx.PublicationTime = strings.TrimSpace(element.Text)
    })

    engine.OnHTML(".detail-desc > p", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        ctx.Content += element.Text
    })

}

核对清单

额外背景

屏幕截图 2024-10-13 190659

foxwhite25 commented 1 month ago

搞了个expoential backoff

foxwhite25 commented 1 month ago

我知道为什么了,这网站的cookie做的一坨,每一个页面都会给你加一个导致最后超出http协议头部长度上限了,可以通过新加的disableCookie解决