fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.05k stars 424 forks source link

ignore_regex configuration option in config.cfg is not working properly #217

Closed marvingabler closed 4 months ago

marvingabler commented 3 years ago

Mandatory

Related issues:

Describe the bug The ignore_regex configuration option in config.cfg seems to be ignored. URL's that contain the specified Regex are still being downloaded.

To Reproduce

  1. Update ignore_regex in config.cfg with ^.*video.*$|^.*mediathek.*$
  2. Add http://welt.de to sitelist.hjson
  3. Run news-please and inspect that URL's still contain video and mediathek

See this regex validation for example welt.de URL's

Expected behavior According to the config.cfg in line 64: urls which match the following regex are ignored for recursive crawling

Log Add a log to help explain your problem, e.g., the full output of the tool that results from running the minimal working example you provided in To Reproduce.

[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/vermischtes_video192762855_Beliebt-gegen-den-Kater-Gurkensaft-Verkaufsschlager-in-New-York_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/vermischtes_video192762855_Beliebt-gegen-den-Kater-Gurkensaft-Verkaufsschlager-in-New-York_1630695592.html.json
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://www.welt.de/mediathek/dokumentation/technik-und-wissen/sendung192055601/Extreme-Phaenomene-Die-Macht-der-Natur.html
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055601_Extreme-Phaenomene-Die-Macht-der-Natur_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055601_Extreme-Phaenomene-Die-Macht-der-Natur_1630695592.html.json
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://www.welt.de/mediathek/dokumentation/technik-und-wissen/sendung192055609/Extreme-Konstruktionen-Spektakulaere-Bauwerke.html
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055609_Extreme-Konstruktionen-Spektakulaere-Bauwerke_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055609_Extreme-Konstruktionen-Spektakulaere-Bauwerke_1630695592.html.json
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://www.welt.de/mediathek/dokumentation/gesellschaft/sendung192112689/Die-verruecktesten-Urlaubsvideos-Hoellische-Ferien.html
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_gesellschaft_sendung192112689_Die-verruecktesten-Urlaubsvideos-Hoellische-Ferien_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_gesellschaft_sendung192112689_Die-verruecktesten-Urlaubsvideos-Hoellische-Ferien_1630695592.html.json

Versions (please complete the following information):

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

Btw great project!

flatplate commented 3 years ago

Which crawler are you using?