issues
search
commoncrawl
/
nutch
Common Crawl fork of Apache Nutch
Apache License 2.0
27
stars
2
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add Github workflow to build the branch 'cc'
#31
sebastian-nagel
opened
1 week ago
0
WARC writer support HTTP/2
#30
sebastian-nagel
closed
3 months ago
0
WARC writer support HTTP/2
#29
sebastian-nagel
closed
3 months ago
1
Generator2: improvements and fixes
#28
sebastian-nagel
closed
6 months ago
0
Add param to fast.urlfilter to filter based on length of the URL
#27
jnioche
closed
11 months ago
5
Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
#26
jnioche
closed
6 months ago
7
What's the difference between Apache Nutch and the commoncrawl fork?
#25
adulau
closed
1 year ago
2
Upgrade webarchive-commons dependency to include fix of SURT maker / URL canonicalizer
#24
sebastian-nagel
opened
1 year ago
0
Evaluate performance of Amazon Corretto Crypto Provider
#23
tfmorris
opened
1 year ago
0
Evaluate zlib-cloudflare for 15% performance speedup of WarcRecordWriter
#22
tfmorris
opened
1 year ago
4
WARC writer: unit tests for conversion of URLs to URIs
#21
sebastian-nagel
opened
1 year ago
0
WARC writer: use URI.toASCIIString() instead of URI.toString()
#20
sebastian-nagel
closed
5 months ago
1
Fetcher: filter and verify robots.txt responses before archiving
#19
sebastian-nagel
closed
3 years ago
1
WarcCdxWriter: extraction of redirect targets for CDX should not be case-sensitive
#18
sebastian-nagel
closed
4 years ago
2
Ensure loading of recent public suffix list (effective_tld_names.dat)
#17
sebastian-nagel
closed
4 years ago
1
Improvements in Hadoop's s3a output committers obsolete class S3FileOutputFormat
#16
sebastian-nagel
opened
4 years ago
1
WARC writer (CDX writer): new optional CDX JSON fields "redirect" and "truncated"
#15
sebastian-nagel
closed
5 years ago
0
WARC-Date in robots.txt subset not to rely on HTTP Date
#14
sebastian-nagel
closed
5 years ago
0
More detailed marking of truncated records due to "network disconnect"
#13
sebastian-nagel
opened
5 years ago
0
[parse-tika] class path issue if parsing recursively
#12
sebastian-nagel
closed
4 years ago
1
[WARC writer] end datetime in WARC file name must be fixed to timelimit
#11
sebastian-nagel
closed
5 years ago
0
[WARC writer / protocol-okhttp] WARC-Truncated header issues and improvements
#10
sebastian-nagel
closed
5 years ago
3
WarcRecordWriter to write and index WAT/WET files
#9
sebastian-nagel
opened
5 years ago
0
WarcRecordWriter performance improvements
#8
sebastian-nagel
closed
5 years ago
5
Speedup initialization of charset AutoDetectReader required for language detection
#7
sebastian-nagel
closed
5 years ago
12
WARC writer language detection: ensure proper charset detection
#6
sebastian-nagel
closed
6 years ago
2
WARC writer incorrectly adds extra line in response records between HTTP headers and payload content
#5
sebastian-nagel
closed
5 years ago
5
Store redirected robots.txt under redirect URL in WARC
#4
sebastian-nagel
closed
7 years ago
0
Add WARC field WARC-Identified-Payload-Type
#3
sebastian-nagel
closed
7 years ago
1
Use capture time for warcinfo WARC-Date and timestemap in WARC filename
#2
sebastian-nagel
closed
7 years ago
1
Redirects lost in DedupRedirectsJob
#1
sebastian-nagel
closed
8 years ago
1