issues
search
commoncrawl
/
cc-index-table
Index Common Crawl archives in tabular format
Apache License 2.0
107
stars
9
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Integrate end-of-term archive table conversion tool
#34
sebastian-nagel
closed
2 weeks ago
0
Add Github workflow / build and test automation
#33
sebastian-nagel
closed
1 month ago
0
News and wat wet compatibilities
#32
jt55401
opened
6 months ago
2
WARC files unreadabe
#31
AmrSheta22
closed
1 year ago
3
Add IP column to Athena table for reverse IP search with `WARC-IP-Address` data
#30
cirosantilli
opened
1 year ago
0
Bump guava from 31.1-jre to 32.0.0-jre
#29
dependabot[bot]
opened
1 year ago
0
Bump spark-core_2.12 from 3.3.2 to 3.4.0
#28
dependabot[bot]
opened
1 year ago
0
Downloading the relevant jar file?
#27
bbrancar
closed
1 year ago
3
Improve extraction of host names and registered domains
#26
sebastian-nagel
opened
1 year ago
0
Consider normalizing host, domain names and TLDs
#25
sebastian-nagel
closed
1 year ago
1
How to use AWS Athena to query CC-NEWS data ?
#24
vansenic
closed
1 month ago
1
Verify example queries using Athena engine v3
#23
sebastian-nagel
opened
1 year ago
0
Bump gson from 2.2.4 to 2.8.9
#22
dependabot[bot]
closed
2 years ago
0
spark-sumbit stopped suddenly
#21
aliebrahiiimi
closed
2 years ago
4
Allow to use a custom table schema
#20
sebastian-nagel
closed
2 years ago
0
CCIndexWarcExport: replace jets3t by AWS SDK (#3), access s3://commoncrawl/ with authentication
#19
sebastian-nagel
closed
2 years ago
0
Add AWS authentication for downloading data
#18
aliebrahiiimi
closed
2 years ago
6
Allow to use a custom table schema
#17
sebastian-nagel
closed
2 years ago
0
Parsing host names fails on trailing dot
#16
sebastian-nagel
closed
2 years ago
0
Support for DNS URIs and other non-HTTP URI/URL schemes
#15
sebastian-nagel
closed
2 years ago
0
Column with host name reverse domain name notation
#14
sebastian-nagel
closed
2 years ago
1
Replace int96 timestamps in index partitions before CC-MAIN-2020
#13
sebastian-nagel
opened
2 years ago
0
Investigate reasons why table isn't fully sorted by `url_surtkey`
#12
sebastian-nagel
closed
2 years ago
1
Explore Zstandard compression
#11
sebastian-nagel
opened
2 years ago
1
Upgrade to Spark 3.2.0
#10
sebastian-nagel
closed
2 years ago
0
Handle dns: lines in the CDXJ files.
#9
vphill
closed
2 years ago
5
Removing System.exit() calls as they interfere with spark Execution
#8
athulj
closed
4 years ago
3
Store column "fetch_time" as int64
#7
sebastian-nagel
closed
4 years ago
2
Add columns for redirect targets and WARC truncation
#6
sebastian-nagel
closed
5 years ago
0
CCIndexWarcExport - Equivalent in Pyspark
#5
lukaskawerau
closed
5 years ago
3
Corrupted ".warc.gz" files being produced
#4
brad-safetonet
closed
5 years ago
3
CCIndexWarcExport: replace jets3t with AWS SDK
#3
sebastian-nagel
closed
2 years ago
6
Problem running the example for getting data for a language in Spark
#2
brad-safetonet
closed
5 years ago
8
can I get the index table data from https:// rather than s3:// ?
#1
imfht
closed
6 years ago
2