issues
search
bitextor
/
warc2text
Extracts plain text, language identification and more metadata from WARC records
MIT License
20
stars
5
forks
source link
Multiple improvements and bug fixes
#6
Closed
zuny26
closed
3 years ago
zuny26
commented
3 years ago
Text extraction changes:
run charset detection on the HTML document (before extracting the clean text)
during HTML parsing, insert new lines when block tags are found
when parsing
script
and
style
elements ignore the content until finding the end tag in order to prevent parsing errors
handle consecutive blanks during HTML parsing
Entities decoding
unescape entities after text extraction and converting text to UTF8, as opposed to doing it during parsing
comprehensive list of named entities
avoid converting
std::string
to
char*
and vice versa
Other changes:
specify output files with
-f
option:
text
and
url
are always written;
mime
and
html
are optional
filter out documents based on extension: created list of extensions that will be ignored
use base64 conversion from preprocess
use zlib directly for output writing, instead of boost wrapper
script
andstyle
elements ignore the content until finding the end tag in order to prevent parsing errorsstd::string
tochar*
and vice versa-f
option:text
andurl
are always written;mime
andhtml
are optional