Closed edsu closed 1 year ago
I see in _utils.parse_timestamp there is already some logic to guard against a day of 00
. But I'm confused by the logic.
In the test it looks like this timestamp 20000800241623
has the day 00
removed leaving 200008241623
and then 00
is appended on the end leaving 20000824162300
. This means the date 2000-08-00 24:16:23
is corrected to 2000-08-24 16:23:00
?
Did someone from the Internet Archive indicate that this was a good way to handle the situation? Absent any other information I would have been inclined to rewrite the 00
as 01
, and also log a warning? But in this case that would have left an hour of 24
which is invalid. Was the 20000800241623
timestamp actually found in the wild, or was it fabricated for the test?
We talked on Slack, but summarizing here for transparency…
Did someone from the Internet Archive indicate that this was a good way to handle the situation? Absent any other information I would have been inclined to rewrite the
00
as01
Yep! Rolling the 24 over to a day of 01 and an hour of 00 was what we initially thought was right, but someone from the Archive found the actual archived content and it turned out that was not correct — somehow an extra 00
had just been stuck in the middle. More here: https://github.com/edgi-govdata-archiving/wayback/pull/85#discussion_r731437939
IIRC, nobody was sure whether it was good to generalize or not, but since a 00
day will always be invalid, this seemed as good a fix as any.
Was the
20000800241623
timestamp actually found in the wild, or was it fabricated for the test?
In the wild! More details in #85.
I happened to be doing this:
and noticed that after running for 10 minutes or so it blew up with:
It looks like the CDX API returned a datetime
20000008241731
which throws an exception during parse because00
isn't a valid month?I don't know what the solution is here: