[x] I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.
[x] I confirm that this bug report does not report on a specific news site where news-please does not work. Please keep in mind that news-please is a generic crawler so it is expected that it will not work for all sites well or even at all.
Related issues:
add them here
Describe the bug
The ignore_regex configuration option in config.cfg seems to be ignored. URL's that contain the specified Regex are still being downloaded.
To Reproduce
Update ignore_regex in config.cfg with ^.*video.*$|^.*mediathek.*$
Add http://welt.de to sitelist.hjson
Run news-please and inspect that URL's still contain video and mediathek
Expected behavior
According to the config.cfg in line 64:
urls which match the following regex are ignored for recursive crawling
Log
Add a log to help explain your problem, e.g., the full output of the tool that results from running the minimal working example you provided in To Reproduce.
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/vermischtes_video192762855_Beliebt-gegen-den-Kater-Gurkensaft-Verkaufsschlager-in-New-York_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/vermischtes_video192762855_Beliebt-gegen-den-Kater-Gurkensaft-Verkaufsschlager-in-New-York_1630695592.html.json
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://www.welt.de/mediathek/dokumentation/technik-und-wissen/sendung192055601/Extreme-Phaenomene-Die-Macht-der-Natur.html
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055601_Extreme-Phaenomene-Die-Macht-der-Natur_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055601_Extreme-Phaenomene-Die-Macht-der-Natur_1630695592.html.json
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://www.welt.de/mediathek/dokumentation/technik-und-wissen/sendung192055609/Extreme-Konstruktionen-Spektakulaere-Bauwerke.html
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055609_Extreme-Konstruktionen-Spektakulaere-Bauwerke_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055609_Extreme-Konstruktionen-Spektakulaere-Bauwerke_1630695592.html.json
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://www.welt.de/mediathek/dokumentation/gesellschaft/sendung192112689/Die-verruecktesten-Urlaubsvideos-Hoellische-Ferien.html
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_gesellschaft_sendung192112689_Die-verruecktesten-Urlaubsvideos-Hoellische-Ferien_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_gesellschaft_sendung192112689_Die-verruecktesten-Urlaubsvideos-Hoellische-Ferien_1630695592.html.json
Versions (please complete the following information):
OS: Ubuntu 20.04
Python Version 3.8
news-please Version 1.5.21
Intent (optional; we'll use this info to prioritize upcoming tasks to work on)
[x] personal
[ ] academic
[ ] business
[ ] other
Some information on your project: Private playing arround
Mandatory
Related issues:
Describe the bug The
ignore_regex
configuration option in config.cfg seems to be ignored. URL's that contain the specified Regex are still being downloaded.To Reproduce
ignore_regex
in config.cfg with^.*video.*$|^.*mediathek.*$
http://welt.de
to sitelist.hjsonvideo
andmediathek
See this regex validation for example welt.de URL's
Expected behavior According to the config.cfg in line 64: urls which match the following regex are ignored for recursive crawling
Log Add a log to help explain your problem, e.g., the full output of the tool that results from running the minimal working example you provided in
To Reproduce
.Versions (please complete the following information):
Intent (optional; we'll use this info to prioritize upcoming tasks to work on)
Btw great project!