cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Apache License 2.0
157 stars 30 forks source link

CommonCrawl index date range code is broken #26

Open wumpus opened 2 years ago

wumpus commented 2 years ago
cdxt --cc --from 2021 --to 2020 -v -v --limit 1 iter https://www.pbm.com/
INFO:cdx_toolkit.cli:set loglevel to DEBUG
DEBUG:cdx_toolkit.myrequests:getting https://index.commoncrawl.org/collinfo.json None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443
DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1" 200 1157
INFO:cdx_toolkit.commoncrawl:Found 87 endpoints in the Common Crawl index
INFO:cdx_toolkit:making a custom cc index list
INFO:cdx_toolkit.commoncrawl:using cc index range from https://index.commoncrawl.org/CC-MAIN-2021-04-index to https://index.commoncrawl.org/CC-MAIN-2020-50-index
INFO:cdx_toolkit:get_more: fetching cdx from https://index.commoncrawl.org/CC-MAIN-2021-04-index

The above date range should be empty.

Medstaar commented 1 year ago

I've recently started using ranges and hit this issue. Is this likely to be picked up in the near future? I've also noticed that the 'closest' argument for commoncrawl works okay and creates a 3 month window, but does not wayback.

wumpus commented 1 year ago

Can you give some examples? The bug I was complaining about shouldn't affect any real usage.

Medstaar commented 1 year ago

Sorry I think I might have miss-understood how the ranges work. It looks like if I put from=20220101 it will use the index CC-MAIN-2021-49 (November 2021), and if I put from=20220401 it will use the CC-MAIN-2022-05 (January 2022). Looks like it actually uses the closest index to the date that's below the date provided.

For wayback if I use closest=20221007 it seems to extract URL's with a 2019 timestamp. Using from and to is okay with wayback however.

wumpus commented 1 year ago

OK, so Common Crawl is doing the right thing, and the closest on wayback issue is a problem on the Internet Archive side, something I can't control.

sgjohnson1981 commented 5 months ago

I don't know what precisely you're trying to explain but my issue is also related to the index date ranges, though I'm trying to programmatically use them with from_ts. Using it with the iter method isn't working. Doesn't return anything. Using it without works, but I don't need every capture going back a year or whatever the default is.