edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

ValueError: time data #88

Closed edsu closed 1 year ago

edsu commented 1 year ago

I happened to be doing this:

from wayback import WaybackClient

ia = WaybackClient()
for result in ia.search('lapdonline.org', matchType='prefix'):
    print(result)

and noticed that after running for 10 minutes or so it blew up with:

Traceback (most recent call last):
  File "/Users/edsummers/Projects/wayback/wayback/_client.py", line 543, in search
    capture_time = _utils.parse_timestamp(data.timestamp)
  File "/Users/edsummers/Projects/wayback/wayback/_utils.py", line 57, in parse_timestamp
    .strptime(''.join(timestamp_chars), URL_DATE_FORMAT)
  File "/usr/local/Cellar/python@3.10/3.10.6_1/Frameworks/Python.framework/Versi
ons/3.10/lib/python3.10/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python@3.10/3.10.6_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '20000008241731' does not match format '%Y%m%d%H%M%S'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/edsummers/Projects/wayback/./x.py", line 6, in <module>
    for result in ia.search('lapdonline.org', matchType='prefix'):
  File "/Users/edsummers/Projects/wayback/wayback/_client.py", line 547, in search
    raise UnexpectedResponseFormat(
wayback.exceptions.UnexpectedResponseFormat: Could not parse CDX output: "org,lapdonline)/community/op_valley_bureau/north_hollywood/map/map.htm 20000008241731 http://www.lapdonline.org:80/community/op_valley_bureau/north_hollywood/map/map.htm text/html 200 2GPKQMU3BLZXOEZ5EWDQEYHPMKWEHNT3 1158" (query: {'url': 'lapdonline.org', 'matchType': 'prefix', 'showResumeKey': 'true', 'resolveRevisits': 'true'})

It looks like the CDX API returned a datetime 20000008241731 which throws an exception during parse because 00 isn't a valid month?

I don't know what the solution is here:

edsu commented 1 year ago

I see in _utils.parse_timestamp there is already some logic to guard against a day of 00. But I'm confused by the logic.

In the test it looks like this timestamp 20000800241623 has the day 00 removed leaving 200008241623 and then 00 is appended on the end leaving 20000824162300. This means the date 2000-08-00 24:16:23 is corrected to 2000-08-24 16:23:00?

Did someone from the Internet Archive indicate that this was a good way to handle the situation? Absent any other information I would have been inclined to rewrite the 00 as 01, and also log a warning? But in this case that would have left an hour of 24 which is invalid. Was the 20000800241623 timestamp actually found in the wild, or was it fabricated for the test?

Mr0grog commented 1 year ago

We talked on Slack, but summarizing here for transparency…

Did someone from the Internet Archive indicate that this was a good way to handle the situation? Absent any other information I would have been inclined to rewrite the 00 as 01

Yep! Rolling the 24 over to a day of 01 and an hour of 00 was what we initially thought was right, but someone from the Archive found the actual archived content and it turned out that was not correct — somehow an extra 00 had just been stuck in the middle. More here: https://github.com/edgi-govdata-archiving/wayback/pull/85#discussion_r731437939

IIRC, nobody was sure whether it was good to generalize or not, but since a 00 day will always be invalid, this seemed as good a fix as any.

Was the 20000800241623 timestamp actually found in the wild, or was it fabricated for the test?

In the wild! More details in #85.