adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
https://htmldate.readthedocs.io
Apache License 2.0
118 stars 26 forks source link

Possibly wrong Mediacloud test data? #154

Open RadhiFadlillah opened 3 months ago

RadhiFadlillah commented 3 months ago

Hi @adbar, thanks for this awesome library.

While porting this library to Go, I noticed there are two Mediacloud tests that might be wrong:

"https://www.baltimoresun.com/opinion/columnists/zurawik/bs-ed-zontv-media-year-20201223-cnvrlhkhnrbihcxx6wxcxt2b7y-story.html#ed=rss_www.baltimoresun.com/arcio/rss/category/latest/": {
    "file": "1805697156.html",
    "date": "2020-12-23"
},
"https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/": {
    "file": "1806793639.html",
    "date": "2020-12-25"
},

For baltimoresun, its JSON+LD contains following snippet:

{
    // ... omitted
    "articleSection": "zurawik",
    "dateCreated": "2020-12-22T01:06:41.361Z",
    "datePublished": "2020-12-23T15:42:33.814Z",
    "dateModified": "2020-12-23T15:42:34.197Z",
    // ... omitted
}

From that snippet we can see its creation date is 2020-12-22. Since we want the original date, I think we should use that one instead of 2020-12-23?


For elbalad.tv, its JSON+LD contains following snippet:

{
    "@type": "WebPage",
    "@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#webpage",
    "url": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/",
    "name": "\u062a\u0631\u0643\u0649 \u0622\u0644 \u0627\u0644\u0634\u064a\u062e \u0628\u0639\u062f \u0625\u0635\u0627\u0628\u0629 \u064a\u0633\u0631\u0627 \u0628\u0643\u0648\u0631\u0648\u0646\u0627: \u064a\u0627\u0631\u0628 \u064a\u0631\u0641\u0639 \u0639\u0646\u0643 - \u0642\u0646\u0627\u0629 \u0635\u062f\u0649 \u0627\u0644\u0628\u0644\u062f",
    "datePublished": "2020-12-25T01:59:50+02:00",
    "dateModified": "2020-12-25T01:59:50+02:00",
    "isPartOf": { "@id": "https://elbaladtv.net/#website" },
    "primaryImageOfPage": {
        "@id": "https://elbaladtv.net/%d8%aa%d8%b1%d9%83%d9%89-%d8%a2%d9%84-%d8%a7%d9%84%d8%b4%d9%8a%d8%ae-%d8%a8%d8%b9%d8%af-%d8%a5%d8%b5%d8%a7%d8%a8%d8%a9-%d9%8a%d8%b3%d8%b1%d8%a7-%d8%a8%d9%83%d9%88%d8%b1%d9%88%d9%86%d8%a7-%d9%8a%d8%a7/#primaryImage"
    },
    "inLanguage": "ar"
}

It also contains following meta tag:

<meta property="article:published_time" content="2020-12-24T23:59:50+00:00">

From those two, we can see that the published time in JSON+LD and meta tags are actually the same except the former is in UTC+2 while the latter is in UTC+0.

So, for extraction result I think we should use 2020-12-24 since it's use UTC time instead of local time.

adbar commented 3 months ago

Hi @RadhiFadlillah Thanks for your feedback, I'll have a look.