anfranken / news-scrap

0 stars 0 forks source link

[WIP] [Close #5] Set up sitelist.hjson #8

Closed afuetterer closed 2 years ago

afuetterer commented 4 years ago

sitelist.hjson contains "spiegel.de" to apply several settings per site. config.cfg is just a copy of the original file that is delivered with news-please.

Related issues: #5, #9

afuetterer commented 4 years ago

I tried to use the ignore_regex setting without success.

I wanted to ignore url patterns like these: https://www.spiegel.de/dienste/ https://www.spiegel.de/extra/

I tried "ignore_regex" : "dienste" "ignore_regex" : "extra" "ignore_regex" : "(\/dienste\/)|(\/extra\/)"

I still end up with those URLs that match the regex being crawled.