attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.69k stars 959 forks source link

Various tags such as q, br, ins, del are not fitered out #300

Open adno opened 1 year ago

adno commented 1 year ago

Many elements/tags appear in wikiextractor's output, such as poem, q, ins, del, br, section, onlyinclude, includeonly, math or mathematical equations (with commands such as \mathbf) not enclosed in any tags.

  1. Download this dump: https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2
  2. Invoke the following command to list lines that contain the opening tags of these elements:

wikiextractor --no-templates --html-safe '' -o - dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2 | grep '<\(poem\|q\|section\|ins\|del\|math\|onlyinclude\|br\|chem\)\b'

Examples from the output:

<poem>
<poem style="margin-left:2em">
<br>"domestic:" good automatic telephone system
…
Benzene, <chem>C6H6</chem>, …
…
<section end="Big Brother series" />
…
<onlyinclude>
…
<chem>O2{} + 4H+(aq){} + 4 Fe^{2+}(cyt\,c) -> 2H2O{} + 4 Fe^{3+}(cyt\,c) </chem> formula_1
…
</includeonly><section end=Lineups />

(Not all of the tags appear in this particular dump.)

adno commented 1 year ago

There similar issues with mapframe and score elements (#301) and table formatting (#298).