Closed edwardchalstrey1 closed 5 years ago
Do we need to check every possible date format for every element that we extract? I'm worried about false positives, but maybe that's OK if we use the previous dictionary of
date => score
setup as for titles?
Not checking every date format for every element extracted. I'm only checking every date format for the single date string that extract_element()
has decided is the correct one from the html.
NOTE: need to edit extractors.py in misinformation crawler to get the date from default parse_to_json when not config
If possible, test against misinformation-crawler
before we merge, as some tweaks to eg. scores or xpaths might be needed.
If possible, test against
misinformation-crawler
before we merge, as some tweaks to eg. scores or xpaths might be needed.
@jemrobinson This has now been done but required a bit of additional refactoring and testing
If we're just parsing dates and dropping timezones, do we actually need anything more than:
from dateutil import parser
dt = parser.parse(<my string>)
It's pretty good at guessing formats. Do you want to have a go with this @edwardchalstrey1 ?
If we're just parsing dates and dropping timezones, do we actually need anything more than:
from dateutil import parser dt = parser.parse(<my string>)
It's pretty good at guessing formats. Do you want to have a go with this @edwardchalstrey1 ?
@jemrobinson I was able to use dateutil parse in almost every case, but use pendulum/arrow for unusual formats when dateutil couldn't convert the date string - tested with misinformation-crawler
@edwardchalstrey1 @jemrobinson I'm pretty sure I tried dateutil
's parser when I first did this and settled on pendulum
with a fallback to arrow
. I can't remember why or find any notes I made. It may be as innocuous as not finding anything it could do that pendulum or arrow couldn't, but my concern would be dateutil.parser returning a less correct date than one of the other two. Can we get a test set of date strings and human-generated true "datetimes with timezoeninfo" for all the dates in all the test pages in the misinformation crawler and see how the three libraries do?
Note that our most general date field can't include a timezone as many articles don't include enough information. However, we should store a fully time zone qualified datetime where this information is present.
Also, see this interesting article by Jon Skeet on the risk of converting a local time into UTC - https://codeblog.jonskeet.uk/2019/03/27/storing-utc-is-not-a-silver-bullet/
We should think carefully about what date representation we want to store in the main date field. My feeling would be to store local time here but the most accurate time representation in the additional metadata JSON field. We should also make sure to flag the limitation on ordering / precedence of articles using the main date field, as without timezone information you can't always even be sure that one article was published ahead of another when its date is the day before (or year before in the case where the second date is 01 Jan).
@jemrobinson I think we already looked into what information we could get from the server about its timezone, but I think (i) this was not guaranteed to be present and (ii) might not reflect the timezone of the person writing the date when it was not generated by the server.
@edwardchalstrey1 @jemrobinson I'm pretty sure I tried
dateutil
's parser when I first did this and settled onpendulum
with a fallback toarrow
. I can't remember why or find any notes I made. It may be as innocuous as not finding anything it could do that pendulum or arrow couldn't, but my concern would be dateutil.parser returning a less correct date than one of the other two. Can we get a test set of date strings and human-generated true "datetimes with timezoeninfo" for all the dates in all the test pages in the misinformation crawler and see how the three libraries do?
@martintoreilly see the date tests I've set up here: https://github.com/alan-turing-institute/ReadabiliPy/pull/58/files#diff-f0f3f8206469e65c1ed152c35699ce41
I now have these tests and all crawler tests passing without using pendulum/arrow at all.
Where I have been editing the test dates in my branch of the crawler, this has only been because of using different xpaths in ReadabiliPy vs the site config file, not because of the different datetime function: https://github.com/alan-turing-institute/misinformation-crawler/pull/187/files
@martintoreilly : I think this PR is ready to merge apart from your issue about the date. Can you confirm what you'd like to be doing here. For reference, what the code in misinformation-crawler
currently does is to interpret all datetime strings as being in a single timezone (implicitly this is UTC), by dropping any timezone information.
Would you like ReadabiliPy to reproduce this behaviour? Or perhaps we should convert to UTC explicitly (ie. taking timezone information into account if it's available). I'm not quite sure what you meant by "local time" in your comment above.
@jemrobinson I think we can merge this if it replicates the behaviour of the existing misinformation-crawler
code.
When I say "local time", I mean the datetime part of the string, removing the timezone. This is what I think the code in misinformation-crawler
does. If so, then although the "timezoneless" datetime may pick up an explicit UTC timezone due to the nature of the arrow/pendulum datetime objects, this is not retained when stored in the database as a datetime2
field.
I think the approach of taking just the date + time component of a datetime string unchanged as "local time" is the best way to ensure that we have a datetime field we can consistently interpret, as when we source this from strings with no timezone info we cannot convert them to a specific timezone such as UTC.
@jemrobinson I think we can merge this if it replicates the behaviour of the existing
misinformation-crawler
code.When I say "local time", I mean the datetime part of the string, removing the timezone. This is what I think the code in
misinformation-crawler
does. If so, then although the "timezoneless" datetime may pick up an explicit UTC timezone due to the nature of the arrow/pendulum datetime objects, this is not retained when stored in the database as adatetime2
field.I think the approach of taking just the date + time component of a datetime string unchanged as "local time" is the best way to ensure that we have a datetime field we can consistently interpret, as when we source this from strings with no timezone info we cannot convert them to a specific timezone such as UTC.
Looks like we're good here then, I set dateutil.parser
to ignore timezone and already have a test that confirms this:
<meta property="article:published_time" content="2019-01-30 09:39:19 -0500" />
comes out as:
2019-01-30T09:39:19
Extract dates in similar fashion to title extraction, with functions copied from misinformation as discussed. We can edit misinformation to require these functions from Readabilipy once this is approved.