Paragraph identification

New feature which identifies paragraphs if flag --paragraph-identification is set. It adds an identifier for each extracted paragraph from the HTML before being base64-encoded. If --paragraph-identification is set, when base64-decoded, it has to be taken into account that the document has been encoded "twice" (if the text is going to be processes, the text will have to be split and get the first element).

The format is: <document content><tab><paragraph identifier>. The paragraph identifier is a number which starts at 0.

bitextor / warc2text

Paragraph identification #33