bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Paragraph identification #33

Closed cgr71ii closed 2 years ago

cgr71ii commented 2 years ago

New feature which identifies paragraphs if flag --paragraph-identification is set. It adds an identifier for each extracted paragraph from the HTML before being base64-encoded. If --paragraph-identification is set, when base64-decoded, it has to be taken into account that the document has been encoded "twice" (if the text is going to be processes, the text will have to be split and get the first element).

The format is: <document content><tab><paragraph identifier>. The paragraph identifier is a number which starts at 0.