Closed MaxDall closed 1 month ago
The severe problem this PR addresses is that we use urlparse().netloc
to map URLs to publishers within CCNewsCrawling. .netloc
returns the entire domain with a subdomain, domain, and top-level domain. Let's make an example using APNews
.
Currently, the "domain" is set as https://www.apnews.com/
, resulting in www.apnews.com
as domain (subdomain www
+ second-level domain apnews
+ top-level domain com
). The correct domain and therefore the one CC-NEWS will crawl and use as target-uri
is https://apnews.com/
, with apnews.com
as the domain. This will result in a mismatch and prevent Fundus from parsing APNews
within using the CC-NEWS dataset.
There would be a solution using tldextract
and parsing only the second-level domain + top-level domain, but I think having the subdomain as well prevents Fundus from parsing unwanted stuff.
The severe problem this PR addresses is that we use
urlparse().netloc
to map URLs to publishers within CCNewsCrawling..netloc
returns the entire domain with a subdomain, domain, and top-level domain. Let's make an example usingAPNews
.Currently, the "domain" is set as
https://www.apnews.com/
, resulting inwww.apnews.com
as domain (subdomainwww
+ second-level domainapnews
+ top-level domaincom
). The correct domain and therefore the one CC-NEWS will crawl and use astarget-uri
ishttps://apnews.com/
, withapnews.com
as the domain. This will result in a mismatch and prevent Fundus from parsingAPNews
within using the CC-NEWS dataset.There would be a solution using
tldextract
and parsing only the second-level domain + top-level domain, but I think having the subdomain as well prevents Fundus from parsing unwanted stuff.
Oh ok, so the main reason for this change is to have the proper identifier when crawling CC-News? I wasn't aware of that, and was a bit confused about the way you chose to make those changes. I am wondering though, if there is another solution, because the way I understand it, this would require a check for every added publisher to see if the domain used in the PublisherSpec corresponds to the one used within CCNews. And if so, we should also update the documentation accordingly.
Or maybe asked differently, if I were to add a new publisher, how would I know how what to use as the domain? Because from an outside perspective in a browser, everything would work just fine.
It's not about the one used in CC-NEWS, it's the other way around. CC-NEWS uses the correct one, the domains set within our specification aren't correct. Even if https://www.apnews.com/
gets resolved to the correct site, the actual site's domain is https://apnews.com/
. You can also look it up in your browser. Just click on the URL.
this would require a check for every added publisher
That's what I did, but not if it's the one used in CC-NEWS, but if it's the correct one. Maybe i write a little test for this.
I think I understood the issue now. I think a test might be a good idea. Because even though you get rerouted to https://apnews.com, if you navigated there using www.apnews.com, and you copy the link, at least in my browser will still give you https://www.apnews.com. But I think that might be something for another PR
Hmm. I'm not getting rerouted as well. That makes everything a bit more complicated. Or it doesn't and we should match without subdomain.
Update: Originally I found out about this problem because the domains I got from articles returned by Fundus do not match the publisher's domain specified in the specs. I have to think about this again. Thanks for your comments :)
Including:
TAZ
NTV
APNews
TheIntercept
FreeBeacon