Multiple improvements and bug fixes

Text extraction changes:
- run charset detection on the HTML document (before extracting the clean text)
- during HTML parsing, insert new lines when block tags are found
- when parsing script and style elements ignore the content until finding the end tag in order to prevent parsing errors
- handle consecutive blanks during HTML parsing
Entities decoding
- unescape entities after text extraction and converting text to UTF8, as opposed to doing it during parsing
- comprehensive list of named entities
- avoid converting std::string to char* and vice versa
Other changes:
- specify output files with -f option: text and url are always written; mime and html are optional
- filter out documents based on extension: created list of extensions that will be ignored
- use base64 conversion from preprocess
- use zlib directly for output writing, instead of boost wrapper

bitextor / warc2text