Closed ydennisy closed 4 years ago
Ok another update!
I have found how to add filters and extract the URL, however I am unclear on the format I should pass for time filtering:
for obj in cdx.iter(url, filter='=status:200', from_ts='1597881600', limit=10):
o = obj
print(obj.data['url'])
Produces: cannot parse timestamp, is it a legal date?: 15978816000000
A **timestamp** represents year-month-day-time as a string of digits run togther.
Example: January 5, 2016 at 12:34:56 UTC is 20160105123456. These timestamps are
a field in the index, and are also used to pick specify the dates used
by **--from=**, **--to**, and **--closest** on the command-line. (Programmatically,
use **from_ts=**, to=, and closest=.)
p.s. it's no surprise this wasn't obvious, CDX has a lot of jargon that's different from the way that words are used elsewhere. I will think about adding a warning if the timestamp looks like a unixtime, ought to be easy.
I pushed a new version with a new warning in this case.
Hi!
I am trying to figure out how to pass on filters, in the same way as possible in the CLI, I am looking to filter by language, date and status code.
The second piece I am missing is when iterating over the results - how does one grab the URL of the crawl?
EDIT:
Also the example on the readme does not print out the content of
obj
, but rather:<cdx_toolkit.CaptureObject object at 0x1125f84d0>