Terminal017 commented 1 month ago

网站 URL

1、域名为：www.thenews.com.pk 2、采集的网站：https://www.thenews.com.pk/tns/

错误情况

1、这个网站在采集到一定网站数量后会产生Bad Request的错误（如图）导致网站全部无法访问，在我的电脑上测试时发现它似乎是固定采集了1200多篇网站后报错。（StartURLs那里一个URL约200篇新闻） 2、尝试通过更改采集的网站顺序和数量，并尝试添加time.sleep进行延迟，但它似乎依旧在同样数量后报错。 3、询问可能的原因和解决方法。

代码

package dev

import (
    "megaCrawler/crawlers"
    "megaCrawler/extractors"
    "strings"

    "github.com/gocolly/colly/v2"
)

func init() {
    //这网站的主域名为www.thenews.com.pk，它只提供当日的新闻，所以采取采集内部的一个专门提供新闻的模块
    engine := crawlers.Register("1450", "国际新闻", "https://www.thenews.com.pk/tns/")

    engine.SetStartingURLs([]string{
        "https://www.thenews.com.pk/tns/category/interviews",
        "https://www.thenews.com.pk/tns/category/dialogue",
        "https://www.thenews.com.pk/tns/category/special-report",
        "https://www.thenews.com.pk/tns/category/art-culture",
        "https://www.thenews.com.pk/tns/category/literati",
        "https://www.thenews.com.pk/tns/category/footloose", //我的电脑测试的最大正常数量
        // "https://www.thenews.com.pk/tns/category/political-economy",
        // "https://www.thenews.com.pk/tns/category/sports",
        // "https://www.thenews.com.pk/tns/category/shehr",
        // "https://www.thenews.com.pk/tns/category/fashion",
        // "https://www.thenews.com.pk/tns/category/encore",
        // "https://www.thenews.com.pk/tns/category/instep",
        // "https://www.thenews.com.pk/tns/category/in-the-picture",
    })

    extractorConfig := extractors.Config{
        Author:       true,
        Image:        true,
        Language:     true,
        PublishDate:  true,
        Tags:         true,
        Text:         false,
        Title:        true,
        TextLanguage: "",
    }

    extractorConfig.Apply(engine)

    engine.OnHTML(".w_c_left > div > ul > li > a", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        engine.Visit(element.Attr("href"), crawlers.News)
    })

    engine.OnHTML(`.pagination_category > a[rel = "next"]`, func(element *colly.HTMLElement, ctx *crawlers.Context) {
        engine.Visit(element.Attr("href"), crawlers.Index)
    })

    engine.OnHTML(".authorFullName > a", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        ctx.Authors = append(ctx.Authors, strings.TrimSpace(element.Text))
    })

    engine.OnHTML("div.detail-time", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        ctx.PublicationTime = strings.TrimSpace(element.Text)
    })

    engine.OnHTML(".detail-desc > p", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        ctx.Content += element.Text
    })

}

核对清单

[x] 如果是关于如何进入入口的问题，我已经确认该网站并没有 sitemap。

额外背景

屏幕截图 2024-10-13 190659

foxwhite25 commented 1 month ago

搞了个expoential backoff

foxwhite25 commented 1 month ago

我知道为什么了，这网站的cookie做的一坨，每一个页面都会给你加一个导致最后超出http协议头部长度上限了，可以通过新加的disableCookie解决

foxwhite25 / megaCrawler

关于Bad Request错误 #130

网站 URL

错误情况

代码

核对清单

额外背景