issues
search
dkpro
/
dkpro-c4corpus
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
https://dkpro.github.io/dkpro-c4corpus
Apache License 2.0
50
stars
8
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Update path to CommonCrawl in documentation
#46
habernal
closed
8 years ago
0
inconsistent package hierarchy and groupId
#45
maxxkia
opened
8 years ago
1
passing directory as argument for boilerplate remover
#44
maxxkia
opened
8 years ago
0
Extract WARC records given a list of URLs
#43
habernal
closed
8 years ago
4
Avoid deploying doc module to repository
#42
habernal
closed
8 years ago
0
Phase4 Deduplication broken?
#41
tfmorris
closed
8 years ago
2
Upgrade to DKPro Parent POM 14
#40
reckart
closed
8 years ago
0
Avoid deploying shaded JAR for hadoop module to repo/Maven central
#39
reckart
closed
8 years ago
0
Limit charset detection to first 8k bytes
#38
tfmorris
closed
8 years ago
0
Make Java JusText implementation match Python and/or document differences
#37
tfmorris
opened
8 years ago
4
Boilerplate removal header post processing incorrect
#36
tfmorris
opened
8 years ago
0
Upgrade to DKPro Parent POM 13
#35
reckart
closed
8 years ago
0
Apache Commons projects are versioned separately...
#34
reckart
closed
8 years ago
0
POM in master branch contains non-SNAPSHOT version
#33
reckart
closed
8 years ago
0
Clarify license for Java JusTex implementation
#32
tfmorris
closed
8 years ago
9
Text normalization too aggressive?
#31
tfmorris
opened
8 years ago
1
HTML entities not decoded
#30
tfmorris
opened
8 years ago
3
Character encoding issues in boilerplate processing
#29
tfmorris
opened
8 years ago
2
Fix O(n!) in tag depth issue
#28
tfmorris
opened
8 years ago
3
O(n!) processing in tag name/path for Paragraph in dedupe code
#27
tfmorris
opened
8 years ago
2
Add use-case example: search for patterns in C4Corpus
#26
habernal
closed
8 years ago
1
Update Hadoop to 2.7.1 to keep up with latest AWS EMR version
#25
habernal
opened
8 years ago
0
Fix javadoc for Java 1.8
#24
habernal
closed
8 years ago
1
Questions on statistics
#23
tfmorris
opened
8 years ago
5
Fix typo in README.md
#22
rmtheis
closed
8 years ago
1
SimHash returning 32-bit results, not 64-bits
#21
tfmorris
opened
8 years ago
1
Fix simhash slicing and add tests. Fixes #19.
#20
tfmorris
opened
8 years ago
3
SimHash slicing algorithm incorrect & inefficient
#19
tfmorris
opened
8 years ago
0
Consistent naming of output folders to match input CommonCrawl
#18
habernal
closed
8 years ago
7
Regularize crawl listing to match input for corpus
#17
tfmorris
closed
8 years ago
1
Replacement GoldStandard 103.txt provided by Miloš Jakubíček - fixes #9
#16
tfmorris
closed
8 years ago
1
Store metadata about keeping minimal html in boilerplate removal
#15
habernal
closed
8 years ago
0
NullWritable as mapper's output key in Phase1 may slow things down
#14
habernal
closed
8 years ago
1
WARCFileWriter throws IOException if file already exists
#13
habernal
closed
8 years ago
3
Upgrade documentation to ascii-doc
#12
habernal
closed
8 years ago
3
Refactoring: Move WARC record outside hadoop module
#11
habernal
closed
8 years ago
0
Add example of reading processed data
#10
habernal
closed
8 years ago
0
Wrong contents in gold standard
#9
tfmorris
closed
8 years ago
7
Phase 1 loses English documents with License=none
#8
habernal
closed
8 years ago
0
Wrong package name in tests in dkpro-c4corpus-hadoop
#7
habernal
closed
8 years ago
0
Delete deprecated classes in de-duplication module
#6
habernal
closed
8 years ago
0
Switch to short folder names and module IDs
#5
reckart
closed
8 years ago
3
Improve CleanEval reproducibility documentation
#4
habernal
closed
8 years ago
3
Deploy release to Maven Central
#3
habernal
closed
8 years ago
4
Improve documentation for the first release
#2
habernal
closed
8 years ago
1
Add citation to the LREC article
#1
habernal
closed
8 years ago
0