foxwhite25 / megaCrawler

A auto restart scrapper
0 stars 7 forks source link

采集突然中断 #120

Closed Homework-DAD closed 4 weeks ago

Homework-DAD commented 1 month ago

网站

中国日报

网址

https://www.chinadaily.com.cn/

问题

我先试着采集国内的新闻。 网址为:https://www.chinadaily.com.cn/china/governmentandpolicy 网站在采集到第一页倒数第二个新闻后,直接停止采集。 以下为网站截图和脚本运行时的录频:

中国日报-国内新闻

https://github.com/user-attachments/assets/5bb038ff-d132-4928-8dd3-1f257daff62b

脚本

package dev

import (
    "megaCrawler/crawlers"
    "megaCrawler/extractors"

    "github.com/gocolly/colly/v2"
)

func init() {
    engine := crawlers.Register("1723", "中国日报", "https://www.chinadaily.com.cn/")

    engine.SetStartingURLs([]string{"https://www.chinadaily.com.cn/china/governmentandpolicy/"})

    extractorConfig := extractors.Config{
        Author:       true,
        Image:        true,
        Language:     true,
        PublishDate:  true,
        Tags:         true,
        Text:         false,
        Title:        true,
        TextLanguage: "",
    }

    extractorConfig.Apply(engine)

    engine.OnHTML(".tw3_01_2_p > a", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        engine.Visit(element.Attr("href"), crawlers.News)
    })

    engine.OnHTML(".lft_art", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        ctx.Content += element.Text
    })

    engine.OnHTML(".pagestyte", func(element *colly.HTMLElement, ctx *crawlers.Context) {
        engine.Visit(element.Attr("href"), crawlers.Index)
    })

}
foxwhite25 commented 4 weeks ago

因为你翻页的selector错了