DavidNemeskey cc_corpus issues

DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl

GNU Lesser General Public License v3.0

12 stars 1 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

S3 download of index

#65 acheronw opened 4 months ago
0
Dockerification

#64 acheronw closed 4 months ago
0
Train / test sampling

#63 DavidNemeskey closed 4 months ago
0
Pipeline refinements

#62 acheronw closed 4 months ago
0
Webapp improvements

#61 acheronw closed 5 months ago
0
Changed download file logging from info to debug

#60 acheronw closed 7 months ago
0
Lsh bug fix

#59 acheronw closed 7 months ago
0
Aws download

#58 acheronw closed 7 months ago
1
Dockerification

#57 acheronw closed 8 months ago
0
Manager webapp

#56 acheronw closed 9 months ago
1
No cookie

#55 DavidNemeskey closed 1 year ago
1
Birosagi hatarozatok

#54 acheronw closed 4 months ago
0
Document json attribute consistency

#53 acheronw closed 1 year ago
0
Token level filtering

#52 acheronw closed 1 year ago
0
warc3-wet -> warc-knot.

#51 DavidNemeskey closed 1 year ago
0
Bib support

#50 DavidNemeskey closed 1 year ago
0
Json output

#49 acheronw closed 1 year ago
0
Resulting (downloaded & filtered/processed) dataset available?

#48 arpitest closed 1 year ago
1
New indexing

#47 acheronw closed 1 year ago
1
Download specific URLs

#46 botondbarta closed 1 year ago
5
Deduplicate webaratas

#45 acheronw opened 1 year ago
0
New indexing

#44 DavidNemeskey closed 1 year ago
0
*.hu unrecognized

#43 bbrancar closed 1 year ago
2
Deduplication-related bugfixes + enhancements

#42 DavidNemeskey closed 1 year ago
0
Json output

#41 acheronw closed 1 year ago
0
Content type stats fixes

#40 DavidNemeskey closed 1 year ago
0
Added tqdm progress bar to content_type_stats script

#39 acheronw closed 1 year ago
0
Content type stats

#38 acheronw closed 1 year ago
0
Parallel cumulative deduplication

#37 acheronw closed 1 year ago
0
Parallel cumulative deduplication

#36 DavidNemeskey closed 1 year ago
1
Json output

#35 acheronw closed 1 year ago
0
Deduplication improvements

#34 acheronw closed 1 year ago
1
RSS handling

#33 DavidNemeskey closed 1 year ago
0
crawling process

#32 holuaat closed 1 year ago
1
Fix multipart downloading

#31 DavidNemeskey closed 1 year ago
0
Slow loading of indexer URLs

#30 DavidNemeskey opened 1 year ago
0
Batch indexer

#29 DavidNemeskey closed 1 year ago
1
Write a script that incrementally deduplicates indices

#28 DavidNemeskey closed 1 year ago
0
New downloader

#27 DavidNemeskey closed 1 year ago
0
Add language filter to filter_warc.py

#26 DavidNemeskey opened 1 year ago
0
Trafilatura

#25 DavidNemeskey closed 1 year ago
0
Use Trafilatura for boilerplate removal

#24 DavidNemeskey opened 2 years ago
0
Faster download

#23 DavidNemeskey closed 1 year ago
1
AWS authentication needed

#22 botondbarta closed 1 year ago
2
Installation

#21 nofreewill42 opened 2 years ago
1
Conversion to fastText format

#20 DavidNemeskey opened 3 years ago
0
Empty elements (sentence, paragraph, document) should be omitted from the corpus

#19 dlazesz closed 3 years ago
3
Strange replacement characters in the text

#18 dlazesz closed 3 years ago
3
Added tqdm to fix_corpus.py and moved otqdm to utils.

#17 DavidNemeskey closed 3 years ago
0
Fixes

#16 DavidNemeskey closed 3 years ago
0