Closed wumpus closed 3 years ago
@wumpus Are you using some default week number for 2012 as well? I tried this and it returns nothing:
cdx = cdx_toolkit.CDXFetcher(source='cc')
url = "www.cnn.com/*"
objs = cdx.iter(url, from_ts='201207', to='201304', filter=['status:200', 'mime:text/html'])
However, objs = cdx.iter(url, from_ts='201201', to='201204', filter=['status:200', 'mime:text/html'])
can return some results. I wonder does this command search the whole "CC-MAIN-2012" index? i.e. is there any part of the index that's not searched if I run this command for every two months (e.g. 201201-201202, 201203-201204)?
I don't think there are any crawl results after July 2012.
BTW the bug I mention above was solved a while ago.
The cdx_toolkit code gets Common Crawl index dates by parsing the index name, CC-MAIN-2020-50
The new (old) indices don't have a week number:
which leads to the bug that 2012 is ignored, and I think 2009 and 2008 are treated as if they have a week number of 20.