cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Apache License 2.0
158 stars 31 forks source link

new Common Crawl year indices #17

Closed wumpus closed 3 years ago

wumpus commented 3 years ago

The cdx_toolkit code gets Common Crawl index dates by parsing the index name, CC-MAIN-2020-50

The new (old) indices don't have a week number:

"id": "CC-MAIN-2012",
"id": "CC-MAIN-2009-2010",
"id": "CC-MAIN-2008-2009",

which leads to the bug that 2012 is ignored, and I think 2009 and 2008 are treated as if they have a week number of 20.

yujianll commented 3 years ago

@wumpus Are you using some default week number for 2012 as well? I tried this and it returns nothing:

cdx = cdx_toolkit.CDXFetcher(source='cc')
url = "www.cnn.com/*"
objs = cdx.iter(url, from_ts='201207', to='201304', filter=['status:200', 'mime:text/html'])

However, objs = cdx.iter(url, from_ts='201201', to='201204', filter=['status:200', 'mime:text/html']) can return some results. I wonder does this command search the whole "CC-MAIN-2012" index? i.e. is there any part of the index that's not searched if I run this command for every two months (e.g. 201201-201202, 201203-201204)?

wumpus commented 3 years ago

I don't think there are any crawl results after July 2012.

BTW the bug I mention above was solved a while ago.