flairNLP / fundus

A very simple news crawler with a funny name
MIT License
127 stars 63 forks source link

Fix domains for several publishers #398

Closed MaxDall closed 1 month ago

MaxDall commented 2 months ago

Including:

MaxDall commented 1 month ago

The severe problem this PR addresses is that we use urlparse().netloc to map URLs to publishers within CCNewsCrawling. .netloc returns the entire domain with a subdomain, domain, and top-level domain. Let's make an example using APNews.

Currently, the "domain" is set as https://www.apnews.com/, resulting in www.apnews.com as domain (subdomain www + second-level domain apnews + top-level domain com). The correct domain and therefore the one CC-NEWS will crawl and use as target-uri is https://apnews.com/, with apnews.com as the domain. This will result in a mismatch and prevent Fundus from parsing APNews within using the CC-NEWS dataset.

There would be a solution using tldextract and parsing only the second-level domain + top-level domain, but I think having the subdomain as well prevents Fundus from parsing unwanted stuff.

addie9800 commented 1 month ago

The severe problem this PR addresses is that we use urlparse().netloc to map URLs to publishers within CCNewsCrawling. .netloc returns the entire domain with a subdomain, domain, and top-level domain. Let's make an example using APNews.

Currently, the "domain" is set as https://www.apnews.com/, resulting in www.apnews.com as domain (subdomain www + second-level domain apnews + top-level domain com). The correct domain and therefore the one CC-NEWS will crawl and use as target-uri is https://apnews.com/, with apnews.com as the domain. This will result in a mismatch and prevent Fundus from parsing APNews within using the CC-NEWS dataset.

There would be a solution using tldextract and parsing only the second-level domain + top-level domain, but I think having the subdomain as well prevents Fundus from parsing unwanted stuff.

Oh ok, so the main reason for this change is to have the proper identifier when crawling CC-News? I wasn't aware of that, and was a bit confused about the way you chose to make those changes. I am wondering though, if there is another solution, because the way I understand it, this would require a check for every added publisher to see if the domain used in the PublisherSpec corresponds to the one used within CCNews. And if so, we should also update the documentation accordingly.

Or maybe asked differently, if I were to add a new publisher, how would I know how what to use as the domain? Because from an outside perspective in a browser, everything would work just fine.

MaxDall commented 1 month ago

It's not about the one used in CC-NEWS, it's the other way around. CC-NEWS uses the correct one, the domains set within our specification aren't correct. Even if https://www.apnews.com/ gets resolved to the correct site, the actual site's domain is https://apnews.com/. You can also look it up in your browser. Just click on the URL.

this would require a check for every added publisher

That's what I did, but not if it's the one used in CC-NEWS, but if it's the correct one. Maybe i write a little test for this.

addie9800 commented 1 month ago

I think I understood the issue now. I think a test might be a good idea. Because even though you get rerouted to https://apnews.com, if you navigated there using www.apnews.com, and you copy the link, at least in my browser will still give you https://www.apnews.com. But I think that might be something for another PR

MaxDall commented 1 month ago

Hmm. I'm not getting rerouted as well. That makes everything a bit more complicated. Or it doesn't and we should match without subdomain.

Update: Originally I found out about this problem because the domains I got from articles returned by Fundus do not match the publisher's domain specified in the specs. I have to think about this again. Thanks for your comments :)