issues
search
commoncrawl
/
cc-pyspark
Process Common Crawl data with Python and Spark
MIT License
405
stars
86
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add CCFileProcessorSparkJob to support file-wise processing
#45
jt55401
opened
3 months ago
4
Update documentation to emphasize that querying the columnar index requires S3 access
#44
sebastian-nagel
closed
7 months ago
0
Example using Resiliparse's HTML parser or text extractor
#43
sebastian-nagel
opened
1 year ago
0
Drop support for Python 2.7, fixes #40
#42
sebastian-nagel
closed
1 year ago
0
Use simdjson to read WAT payloads
#41
sebastian-nagel
opened
1 year ago
0
Drop support for Python 2.7
#40
sebastian-nagel
closed
1 year ago
1
Looks like ccspark tried to access everything from local file. What's wrong with the settings?
#39
GenuineReader
closed
1 year ago
1
Provide classes to use FastWARC to read WARC/WAT/WET files
#38
sebastian-nagel
closed
1 year ago
0
Provide classes to use FastWARC to read WARC/WAT/WET files
#37
sebastian-nagel
closed
1 year ago
0
Host link extraction does not represent every IDN as IDNA (fixes #35)
#36
sebastian-nagel
closed
2 years ago
0
Host link extraction does not represent every IDN as IDNA
#35
sebastian-nagel
closed
2 years ago
0
Incompatible Architecture
#34
swetepete
closed
1 year ago
4
Bad Substitution
#33
swetepete
closed
2 years ago
3
boto3 credentials error when running CCSparkJob with ~100 S3 warc paths as input, but works with <10 S3 warc paths as input
#32
praveenr019
closed
1 year ago
5
Variable data_type is incorrectly used
#31
praveenr019
closed
2 years ago
3
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
#30
BrownXing
closed
2 years ago
2
download only specific language data from wet files like Warc
#29
aliebrahiiimi
closed
1 year ago
1
Can not run server_count example on Windows locally
#28
brand17
closed
3 years ago
6
How To: process CC NEWS warc files, most recent first
#27
lgov
closed
2 years ago
3
Add support to read WARC/WAT/WET files from HDFS
#26
sebastian-nagel
closed
3 years ago
1
Use SparkSession instead of SQLContext, implements #24
#25
sebastian-nagel
closed
2 years ago
0
Use SparkSession instead of SQLContext
#24
sebastian-nagel
closed
2 years ago
5
Broken links in README
#23
gamtiq
closed
3 years ago
1
Use index field "content_charset" to speed up HTML parsing of WARC payload
#22
sebastian-nagel
closed
3 years ago
0
add arc file capability
#21
Xue-Alex
closed
3 years ago
3
Test and update examples to work with ARC files of the 2008 - 2012 crawls
#20
sebastian-nagel
opened
4 years ago
0
CCIndexSparkJob: allow to set schema of table "ccindex" via command-line
#19
sebastian-nagel
closed
3 years ago
0
Common Crawl Index Table - Need for Schema Merging to be documented
#18
chk2817
closed
3 years ago
2
Processing English only archives
#17
jaehunro
closed
4 years ago
2
Fix wrong accumulator name
#16
jaehunro
closed
4 years ago
1
Webgraph construction: include nodes with zero outgoing links
#15
sebastian-nagel
closed
4 years ago
0
CCIndexWarcSparkJob requires one of --query or --csv
#14
sebastian-nagel
closed
4 years ago
0
Document dependency of CCIndexSparkJob to Java S3 file system libs
#13
sebastian-nagel
closed
4 years ago
0
Commands to execute python files?
#12
calee88
closed
4 years ago
7
Avoid to set Spark configuration properties in code
#11
sebastian-nagel
closed
5 years ago
0
Add options to control spark resource allocation.
#10
tylerkovacs
closed
5 years ago
2
Mine metrics about truncated content in WARC files
#9
sebastian-nagel
closed
2 years ago
1
Allow to access WARC filename, record offset and length
#8
sebastian-nagel
closed
5 years ago
0
Drop support for Python2
#7
sebastian-nagel
closed
1 year ago
1
Allow to access WARC record filename and offset
#6
sebastian-nagel
closed
5 years ago
2
HDFS Patch
#5
cronoik
closed
3 years ago
1
Rob
#4
rdowd003
closed
2 years ago
0
Work-around to enable support of WARC/1.1 in warcio
#3
sebastian-nagel
closed
2 years ago
0
Close streams when they are from files
#2
lauraedelson
closed
6 years ago
4
Add arg[option] '--no-check-certificate' to wget
#1
layedra
closed
6 years ago
1