issues
search
centic9
/
CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
BSD 2-Clause "Simplified" License
63
stars
20
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Downloaded PDF files are capped at 1 MB
#28
adamklec
closed
1 year ago
2
Gradle 6.0.1
#27
ghost
closed
4 years ago
0
Gradle 6.0
#26
ghost
closed
4 years ago
0
Gradle 5.6.4
#25
ghost
closed
4 years ago
0
Gradle 5.6.3
#24
ghost
closed
4 years ago
0
Gradle 5.6.2
#23
ghost
closed
4 years ago
0
Gradle 5.6.1
#22
ghost
closed
4 years ago
0
Gradle 5.6
#21
ghost
closed
4 years ago
0
Gradle 5.5.1
#20
ghost
closed
4 years ago
0
Gradle 5.5
#19
ghost
closed
4 years ago
0
Gradle 5.4.1
#18
ghost
closed
4 years ago
0
Gradle 5.4
#17
ghost
closed
4 years ago
0
Gradle 5.3.1
#16
ghost
closed
4 years ago
0
Gradle 5.3
#15
ghost
closed
4 years ago
0
Gradle 5.2.1
#14
ghost
closed
4 years ago
0
Maximum file size
#13
ankitmodi10
closed
6 years ago
1
Wet and Wat files
#12
burf2000
closed
6 years ago
5
Documentation needed
#11
gleporeNARA
closed
2 years ago
1
Unable to download
#10
fizerkhan
closed
7 years ago
3
Use exclusively Common Crawl's new Public Dataset bucket s3://commoncrawl/
#9
sebastian-nagel
closed
7 years ago
2
Gradle 2.14
#8
ghost
closed
8 years ago
0
Gradle 2.13
#7
ghost
closed
8 years ago
1
Gradle 2.12
#6
ghost
closed
8 years ago
0
Gradle 2.11
#5
ghost
closed
8 years ago
0
Gradle 2.10
#4
ghost
closed
8 years ago
0
Gradle 2.9
#3
ghost
closed
8 years ago
0
Gradle 2.8
#2
ghost
closed
9 years ago
1
Gradle 2.7
#1
ghost
closed
9 years ago
1