adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
https://htmldate.readthedocs.io
Apache License 2.0
118 stars 26 forks source link

how are timezones handled when available? #32

Closed rahulbot closed 3 years ago

rahulbot commented 3 years ago

Some articles include the full publication time, with timezone, in HTML meta tags or Javascript config. Does this library parse and handle those timezones? Relatedly, how does it internally store dates with regards to timezone - are the all returned in machine-local time, held in GMT, or something else?

For instance, this Guardian article includes the article:published_time meta tag with a timezone included. Does this library recognize that timezone and return the date as it would be in GMT? Same for this article on CNN, which includes the datePublished meta tag.

adbar commented 3 years ago

Hi @rahulbot, since I was mostly interested in a granularity on day level I didn't implement time zone identification so far. However, the underlying libraries python-dateutil, dateparser, and the optional one ciso8601 all deal with it IMHO.

rahulbot commented 3 years ago

Good to know, thanks. In the longer term, if we do switch to htmldate for use in Media Cloud we might explore integrating time parsing (at least for the machine readable timestamps in metadata). In that case we'd probably add in timezone parsing.

moehmeni commented 2 years ago

You can use %Y-%m-%dT%H:%M:%S%z as the outputformat argument Output : 2021-10-18T15:30:00+0330


And with that output and something like python-dateutil package (parse method) , you can reach this pattern : 2021-10-18 15:30:00+03:30