adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
https://htmldate.readthedocs.io
Apache License 2.0
118 stars 26 forks source link

ignore undateable domains more intentionally #34

Open rahulbot opened 3 years ago

rahulbot commented 3 years ago

In our testing the current code produces unreliable results when tested on Wikipedia articles. Sometimes it returns a data, sometimes it doesn't. Wikipedia articles are constantly updated, so @coreydockser and I would like to propose to change it so it returns no date if the URL is a wikipedia.org one. In our broader experience with Media Cloud this produces more useful results (for our open web news analysis context).

In terms of implementation, we could just copy filter_url_for_undateable function from date_guesser and use that as is to include the other checks it does for undateable domains. We'd call it early on in guess_date.

adbar commented 3 years ago

Hi @rahulbot, it would be OK but I'd prefer to get to chance to tackle the problem first. There is certainly a field in the HTML where the date can be extracted from, would you mind giving examples of pages where the result wasn't as expected?

rahulbot commented 3 years ago

@coreydockser can you please provide an example of a wikipedia page that does return a publication date, and one that does not?

coreydockser commented 3 years ago

Sorry for the delay, I ran into some odd issues of my own making. Anyways, here's a sample of four articles with different results.

https://en.wikipedia.org/wiki/Among_Us – returns None (this is the behavior we want)

https://en.wikipedia.org/wiki/January_1969 – returns 2018-06-19, this date appears as datePublished in the html

https://en.wikipedia.org/wiki/F-scale_(personality_test) - returns 2005-07-05. the datePublished on this page is 2005-07-25, though, so I'm unsure where it came from.

https://en.wikipedia.org/wiki/2021_United_States_Capitol_attack - 2021-01-06, this is the date of the event, but it's also the datePublished.

adbar commented 3 years ago

@coreydockser Thanks, I'll look at it and see if I can find a solution.

adbar commented 3 years ago

Hi @coreydockser, I checked the cases and I don't agree with you at all:

So I fail to grasp where the problem lies, could you please be more specific and/or provide further examples for other websites?

rahulbot commented 2 years ago

The library version issue could explain some of those specific results. However the second piece is more of a question of your intentions. In our projects, "publication date" means the date a news article was listed as being published online. That is rooted in ideas from the historical news industry (despite edits and iterations of online stories becoming more commonplace). Wikipedia articles are meant to be living documents, so for us they don't have a "publication date" in that sense. This is important for our time-series based analysis of news attention.

So I guess the one way to state the question is like this: for this library do you intend "publication date" to have a technology-informed definition such as the date of last edit? Or do you want a more "news-ish" definition like we use?

It sounds like it is more the former, in which case there are no "undateable" domains. If that is what you intend, then we can close this issue as won't-fix and we can handle the idea of "undateable" domains based on our project definition in our own code before we pass content into htmldate.

Thanks for any clarifications and your great work on this library!

adbar commented 2 years ago

Thanks for the explanations, I get your point. Indeed, htmldate mostly provides a technology-informed concept of datation. It hopefully intersects the news-ish definition in most cases, however the two may vary.

I guess it would be possible to focus on a "news-ish" understanding of publication date by setting an additional parameter prior to the extraction. What would be the formal requirements for it to happen?

I'm leaving this thread open to see if we can address the issue.