edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
61 stars 12 forks source link

Fix handling undefined length in CdxRecord #83

Closed 8W9aG closed 2 years ago

8W9aG commented 2 years ago
import wayback
client = wayback.WaybackClient()
records = client.search("www.cnn.com/*", matchType="domain", filter_field="statuscode:200")
len(list(records))
wayback.exceptions.UnexpectedResponseFormat: Could not parse CDX output: "com,cnn)/ 20100218025451 http://www.cnn.com/ text/html 200 JK7O4OWOMDJHBNQJ5U43X5OQRORF6FQB -" (query: {'url': 'www.cnn.com/*', 'matchType': 'domain', 'filter': 'statuscode:200', 'showResumeKey': 'true', 'resolveRevisits': 'true'})

The exception occurs on line 542 of _client.py. The problem is the service returns "-" where length should be, probably indicating that for some reason this data is missing, however the wayback client attempts to parse it to an int. In this case, use None instead.

Mr0grog commented 2 years ago

Wow, I’m surprised I haven’t run into this before! Could you please add a test? After that, I’ll get this merged and released ASAP.

Since this is covering a pretty narrow scenario, it’s probably better to use a mock (most of the tests use VCR). This one is a decent example: https://github.com/edgi-govdata-archiving/wayback/blob/024a80ec0eae44fcf64e3d834d5f43c78937b8fb/wayback/tests/test_client.py#L166-L190

But since this is such a simple case, you probably don’t need the response content in a separate file. You could just have a string right in the source code for the test.

8W9aG commented 2 years ago

@Mr0grog thanks for the feedback, added a test case

8W9aG commented 2 years ago

@Mr0grog That sounds great thanks!

Mr0grog commented 2 years ago

You can now grab this from PyPI as version 0.3.1! 🚀