cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Apache License 2.0
158 stars 31 forks source link

Filters and url of crawled page in python client #9

Closed ydennisy closed 4 years ago

ydennisy commented 4 years ago

Hi!

I am trying to figure out how to pass on filters, in the same way as possible in the CLI, I am looking to filter by language, date and status code.

The second piece I am missing is when iterating over the results - how does one grab the URL of the crawl?

import cdx_toolkit

cdx = cdx_toolkit.CDXFetcher(source='cc')
url = 'dailymail.co.uk/*'

print(url, 'size estimate', cdx.get_size_estimate(url))

for obj in cdx.iter(url, limit=100):
    print(obj.text) # obj.url would be nice :)

EDIT:

Also the example on the readme does not print out the content of obj, but rather: <cdx_toolkit.CaptureObject object at 0x1125f84d0>

ydennisy commented 4 years ago

Ok another update!

I have found how to add filters and extract the URL, however I am unclear on the format I should pass for time filtering:

for obj in cdx.iter(url, filter='=status:200', from_ts='1597881600', limit=10):
    o = obj
    print(obj.data['url'])

Produces: cannot parse timestamp, is it a legal date?: 15978816000000

wumpus commented 4 years ago
A **timestamp** represents year-month-day-time as a string of digits run togther.
Example: January 5, 2016 at 12:34:56 UTC is 20160105123456. These timestamps are
a field in the index, and are also used to pick specify the dates used
by **--from=**, **--to**, and **--closest** on the command-line. (Programmatically,
use **from_ts=**, to=, and closest=.)
wumpus commented 4 years ago

p.s. it's no surprise this wasn't obvious, CDX has a lot of jargon that's different from the way that words are used elsewhere. I will think about adding a warning if the timestamp looks like a unixtime, ought to be easy.

wumpus commented 4 years ago

I pushed a new version with a new warning in this case.