issues
search
bitextor
/
warc2text
Extracts plain text, language identification and more metadata from WARC records
MIT License
20
stars
5
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Use regular expression for filters
#10
zuny26
closed
3 years ago
0
Write PDF records to a separate WARC file
#9
zuny26
closed
3 years ago
0
Normalization of different whitespace HTML entities
#8
zuny26
opened
3 years ago
0
Spaces after inline tags?
#7
zuny26
opened
3 years ago
0
Multiple improvements and bug fixes
#6
zuny26
closed
3 years ago
0
Installation issue with uchardet (when installed in custom location)
#5
mksifakis
closed
3 years ago
12
Filter out documents that contain tags that could indicate machine translation
#4
zuny26
closed
3 years ago
3
Use uchardet to detect document encoding
#3
zuny26
closed
3 years ago
0
HTML5 should implicitly close some tags
#2
lpla
closed
3 years ago
1
Only separate paragraphs by block tags, not HTML text endlines
#1
lpla
closed
3 years ago
1
Previous