issues
search
cocrawler
/
cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Apache License 2.0
157
stars
30
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
feat: split CI into normal and slow
#38
wumpus
closed
2 days ago
1
fix: remove unnecessary sleep
#37
wumpus
closed
2 days ago
1
improve ci
#36
wumpus
closed
3 days ago
1
Retries
#35
wumpus
closed
5 days ago
0
revive CI
#34
wumpus
closed
4 days ago
1
crasher
#33
wumpus
opened
1 week ago
0
Bump setuptools from 57.0.0 to 70.0.0
#32
dependabot[bot]
closed
6 days ago
1
Bump requests from 2.25.1 to 2.32.0
#31
dependabot[bot]
closed
6 days ago
1
Raise an error on 429 (rate limiting)
#30
davidlenz
closed
6 months ago
1
...
#29
aweitz
closed
11 months ago
1
Fixes pip install with python 3.10 and 3.11
#28
mgrbyte
closed
6 months ago
5
Bump setuptools from 57.0.0 to 65.5.1
#27
dependabot[bot]
closed
6 months ago
1
CommonCrawl index date range code is broken
#26
wumpus
opened
2 years ago
5
myrequests.py gets stuck in a loop if the response status is always 429, 500, 502, 503, 504, 509
#25
Medstaar
closed
2 years ago
3
`collinfo.json` URL returning empty list
#24
nfmcclure
closed
2 years ago
2
[do not merge] add support for the columnar index
#23
wumpus
opened
2 years ago
0
Retrieving objects for a set or list of URL's in parallel
#22
vikas95
opened
2 years ago
3
python 3.10 testing
#21
wumpus
closed
2 years ago
0
Results not complete for Common Crawl index 2012
#20
yujianll
closed
3 years ago
1
404 seen for API call
#19
yujianll
closed
3 years ago
1
Fix common crawl date ranges
#18
wumpus
closed
3 years ago
0
new Common Crawl year indices
#17
wumpus
closed
3 years ago
2
next(CDXFetcherIter) hangs when cc request returns 400/404
#16
windymay
closed
3 years ago
2
Update README.md
#15
yeus
opened
3 years ago
1
Azure ci
#14
wumpus
closed
3 years ago
0
"ValueError('invalid hostname in url '+url) from None" when accessing internet archive CaptureObject.content
#13
codekoriko
closed
3 years ago
2
make fetching capture contents a visible API
#12
wumpus
closed
3 years ago
1
[Question]
#11
ghost
closed
3 years ago
3
here is some usage examples
#10
sloev
closed
3 years ago
2
Filters and url of crawled page in python client
#9
ydennisy
closed
4 years ago
4
Read timed out
#8
Kavan72
closed
4 years ago
5
Is it possible to get only one URL of one domain from a TLD?
#7
bch80
closed
4 years ago
4
WET-files
#6
Cenzor
closed
4 years ago
1
Typo fix in README.md
#5
lgov
closed
4 years ago
3
Command-line tools are shipped with non-portable Python runtime
#4
sebastian-nagel
closed
5 years ago
3
Fix spelling of Ilya's surname in README.
#3
machawk1
closed
6 years ago
2
fetch_wb_content "js_" flag replaced w/ "id_"
#2
marshallduval
closed
6 years ago
6
changed to alternate wayback cdx server
#1
marshallduval
closed
6 years ago
1