alan-turing-institute / ReadabiliPy

A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.
MIT License
230 stars 36 forks source link

Date extraction #58

Closed edwardchalstrey1 closed 5 years ago

edwardchalstrey1 commented 5 years ago

Extract dates in similar fashion to title extraction, with functions copied from misinformation as discussed. We can edit misinformation to require these functions from Readabilipy once this is approved.

edwardchalstrey1 commented 5 years ago

Do we need to check every possible date format for every element that we extract? I'm worried about false positives, but maybe that's OK if we use the previous dictionary of date => score setup as for titles?

Not checking every date format for every element extracted. I'm only checking every date format for the single date string that extract_element() has decided is the correct one from the html.

edwardchalstrey1 commented 5 years ago

NOTE: need to edit extractors.py in misinformation crawler to get the date from default parse_to_json when not config

jemrobinson commented 5 years ago

If possible, test against misinformation-crawler before we merge, as some tweaks to eg. scores or xpaths might be needed.

edwardchalstrey1 commented 5 years ago

If possible, test against misinformation-crawler before we merge, as some tweaks to eg. scores or xpaths might be needed.

@jemrobinson This has now been done but required a bit of additional refactoring and testing

jemrobinson commented 5 years ago

If we're just parsing dates and dropping timezones, do we actually need anything more than:

from dateutil import parser
dt = parser.parse(<my string>)

It's pretty good at guessing formats. Do you want to have a go with this @edwardchalstrey1 ?

edwardchalstrey1 commented 5 years ago

If we're just parsing dates and dropping timezones, do we actually need anything more than:

from dateutil import parser
dt = parser.parse(<my string>)

It's pretty good at guessing formats. Do you want to have a go with this @edwardchalstrey1 ?

@jemrobinson I was able to use dateutil parse in almost every case, but use pendulum/arrow for unusual formats when dateutil couldn't convert the date string - tested with misinformation-crawler

martintoreilly commented 5 years ago

@edwardchalstrey1 @jemrobinson I'm pretty sure I tried dateutil's parser when I first did this and settled on pendulum with a fallback to arrow. I can't remember why or find any notes I made. It may be as innocuous as not finding anything it could do that pendulum or arrow couldn't, but my concern would be dateutil.parser returning a less correct date than one of the other two. Can we get a test set of date strings and human-generated true "datetimes with timezoeninfo" for all the dates in all the test pages in the misinformation crawler and see how the three libraries do?

martintoreilly commented 5 years ago

Note that our most general date field can't include a timezone as many articles don't include enough information. However, we should store a fully time zone qualified datetime where this information is present.

martintoreilly commented 5 years ago

Also, see this interesting article by Jon Skeet on the risk of converting a local time into UTC - https://codeblog.jonskeet.uk/2019/03/27/storing-utc-is-not-a-silver-bullet/

martintoreilly commented 5 years ago

We should think carefully about what date representation we want to store in the main date field. My feeling would be to store local time here but the most accurate time representation in the additional metadata JSON field. We should also make sure to flag the limitation on ordering / precedence of articles using the main date field, as without timezone information you can't always even be sure that one article was published ahead of another when its date is the day before (or year before in the case where the second date is 01 Jan).

martintoreilly commented 5 years ago

@jemrobinson I think we already looked into what information we could get from the server about its timezone, but I think (i) this was not guaranteed to be present and (ii) might not reflect the timezone of the person writing the date when it was not generated by the server.

martintoreilly commented 5 years ago

Also see the following packages:

edwardchalstrey1 commented 5 years ago

@edwardchalstrey1 @jemrobinson I'm pretty sure I tried dateutil's parser when I first did this and settled on pendulum with a fallback to arrow. I can't remember why or find any notes I made. It may be as innocuous as not finding anything it could do that pendulum or arrow couldn't, but my concern would be dateutil.parser returning a less correct date than one of the other two. Can we get a test set of date strings and human-generated true "datetimes with timezoeninfo" for all the dates in all the test pages in the misinformation crawler and see how the three libraries do?

@martintoreilly see the date tests I've set up here: https://github.com/alan-turing-institute/ReadabiliPy/pull/58/files#diff-f0f3f8206469e65c1ed152c35699ce41

I now have these tests and all crawler tests passing without using pendulum/arrow at all.

Where I have been editing the test dates in my branch of the crawler, this has only been because of using different xpaths in ReadabiliPy vs the site config file, not because of the different datetime function: https://github.com/alan-turing-institute/misinformation-crawler/pull/187/files

edwardchalstrey1 commented 5 years ago

Also see the following packages:

Newspaper looks very promising!

jemrobinson commented 5 years ago

@martintoreilly : I think this PR is ready to merge apart from your issue about the date. Can you confirm what you'd like to be doing here. For reference, what the code in misinformation-crawler currently does is to interpret all datetime strings as being in a single timezone (implicitly this is UTC), by dropping any timezone information.

Would you like ReadabiliPy to reproduce this behaviour? Or perhaps we should convert to UTC explicitly (ie. taking timezone information into account if it's available). I'm not quite sure what you meant by "local time" in your comment above.

martintoreilly commented 5 years ago

@jemrobinson I think we can merge this if it replicates the behaviour of the existing misinformation-crawler code.

When I say "local time", I mean the datetime part of the string, removing the timezone. This is what I think the code in misinformation-crawler does. If so, then although the "timezoneless" datetime may pick up an explicit UTC timezone due to the nature of the arrow/pendulum datetime objects, this is not retained when stored in the database as a datetime2 field.

I think the approach of taking just the date + time component of a datetime string unchanged as "local time" is the best way to ensure that we have a datetime field we can consistently interpret, as when we source this from strings with no timezone info we cannot convert them to a specific timezone such as UTC.

edwardchalstrey1 commented 5 years ago

@jemrobinson I think we can merge this if it replicates the behaviour of the existing misinformation-crawler code.

When I say "local time", I mean the datetime part of the string, removing the timezone. This is what I think the code in misinformation-crawler does. If so, then although the "timezoneless" datetime may pick up an explicit UTC timezone due to the nature of the arrow/pendulum datetime objects, this is not retained when stored in the database as a datetime2 field.

I think the approach of taking just the date + time component of a datetime string unchanged as "local time" is the best way to ensure that we have a datetime field we can consistently interpret, as when we source this from strings with no timezone info we cannot convert them to a specific timezone such as UTC.

Looks like we're good here then, I set dateutil.parser to ignore timezone and already have a test that confirms this: <meta property="article:published_time" content="2019-01-30 09:39:19 -0500" />

comes out as:

2019-01-30T09:39:19