issues
search
DavidNemeskey
/
cc_corpus
Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12
stars
1
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
S3 download of index
#65
acheronw
opened
4 months ago
0
Dockerification
#64
acheronw
closed
4 months ago
0
Train / test sampling
#63
DavidNemeskey
closed
4 months ago
0
Pipeline refinements
#62
acheronw
closed
4 months ago
0
Webapp improvements
#61
acheronw
closed
5 months ago
0
Changed download file logging from info to debug
#60
acheronw
closed
7 months ago
0
Lsh bug fix
#59
acheronw
closed
7 months ago
0
Aws download
#58
acheronw
closed
7 months ago
1
Dockerification
#57
acheronw
closed
8 months ago
0
Manager webapp
#56
acheronw
closed
9 months ago
1
No cookie
#55
DavidNemeskey
closed
1 year ago
1
Birosagi hatarozatok
#54
acheronw
closed
4 months ago
0
Document json attribute consistency
#53
acheronw
closed
1 year ago
0
Token level filtering
#52
acheronw
closed
1 year ago
0
warc3-wet -> warc-knot.
#51
DavidNemeskey
closed
1 year ago
0
Bib support
#50
DavidNemeskey
closed
1 year ago
0
Json output
#49
acheronw
closed
1 year ago
0
Resulting (downloaded & filtered/processed) dataset available?
#48
arpitest
closed
1 year ago
1
New indexing
#47
acheronw
closed
1 year ago
1
Download specific URLs
#46
botondbarta
closed
1 year ago
5
Deduplicate webaratas
#45
acheronw
opened
1 year ago
0
New indexing
#44
DavidNemeskey
closed
1 year ago
0
*.hu unrecognized
#43
bbrancar
closed
1 year ago
2
Deduplication-related bugfixes + enhancements
#42
DavidNemeskey
closed
1 year ago
0
Json output
#41
acheronw
closed
1 year ago
0
Content type stats fixes
#40
DavidNemeskey
closed
1 year ago
0
Added tqdm progress bar to content_type_stats script
#39
acheronw
closed
1 year ago
0
Content type stats
#38
acheronw
closed
1 year ago
0
Parallel cumulative deduplication
#37
acheronw
closed
1 year ago
0
Parallel cumulative deduplication
#36
DavidNemeskey
closed
1 year ago
1
Json output
#35
acheronw
closed
1 year ago
0
Deduplication improvements
#34
acheronw
closed
1 year ago
1
RSS handling
#33
DavidNemeskey
closed
1 year ago
0
crawling process
#32
holuaat
closed
1 year ago
1
Fix multipart downloading
#31
DavidNemeskey
closed
1 year ago
0
Slow loading of indexer URLs
#30
DavidNemeskey
opened
1 year ago
0
Batch indexer
#29
DavidNemeskey
closed
1 year ago
1
Write a script that incrementally deduplicates indices
#28
DavidNemeskey
closed
1 year ago
0
New downloader
#27
DavidNemeskey
closed
1 year ago
0
Add language filter to filter_warc.py
#26
DavidNemeskey
opened
1 year ago
0
Trafilatura
#25
DavidNemeskey
closed
1 year ago
0
Use Trafilatura for boilerplate removal
#24
DavidNemeskey
opened
2 years ago
0
Faster download
#23
DavidNemeskey
closed
1 year ago
1
AWS authentication needed
#22
botondbarta
closed
1 year ago
2
Installation
#21
nofreewill42
opened
2 years ago
1
Conversion to fastText format
#20
DavidNemeskey
opened
3 years ago
0
Empty elements (sentence, paragraph, document) should be omitted from the corpus
#19
dlazesz
closed
3 years ago
3
Strange replacement characters in the text
#18
dlazesz
closed
3 years ago
3
Added tqdm to fix_corpus.py and moved otqdm to utils.
#17
DavidNemeskey
closed
3 years ago
0
Fixes
#16
DavidNemeskey
closed
3 years ago
0
Next