bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Spaces after inline tags? #7

Open zuny26 opened 3 years ago

zuny26 commented 3 years ago

In some documents, text segments that are denoted by inline elements (such as <span>) don't have spaces between them in the HTML, but are visually separated using CSS. In these cases not putting a space after inline tags will create glued text, even though it is technically correct to not add any spaces.

However, inline elements (such as <span>, <b> or <i>) can also be used to put multiple formats inside a single word, and putting a space after them will split these words.