Several elements with non-textual content such as maps and musical scores (elements mapframe and score) are not filtered out. Steps to reproduce:
Download this dump: https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2
Invoke the following command to list lines that contain the opening tags of these elements:
wikiextractor --no-templates --html-safe '' -o - dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2 | grep '<\(score\|mapframe\)\b'
Output:
<score sound="1"> % Adding least one space before each line is recommended
<mapframe latitude="37.7235" longitude="23.4819" zoom="10" width="200" height="131" align="left" />Aegina is roughly triangular in shape, approximately from east to west and from north to south, with an area of .
<score sound="1">{ \time 4/4 c'4 c' g' g' | a' a' g'2 | f'4 f' e' e' | d' d' c'2 | g'4 g' f' f' | e' e' d'2 | g'4 \times 2/3 { f'8 f' f' } e'4 d' | c' r r2 | \bar "|." } \addlyrics { A B C D E F G H I J K L M N O P Q R S T U V dub- a- U X Y "Z(ed)" }</score>
<score sound="1">
<score sound="1">
<mapframe width=400 height=200 zoom=6 latitude=42.977 longitude=-76.506 align=left text="The original path of the Erie Canal">
<mapframe latitude="37.807984" longitude="-122.475411" zoom="18" width="213" height="180" align="right">
In general, the output also contains the content delimited by the tags (musical scores and map data). In some cases, both of the opening/closing tags (or parth the score itself) for musical scores are missing, e.g. article id="152940" from dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles3.xml-p151574p311329.bz2 contains only the opening <score>:
Sheet music does not often explicitly indicate "Bebung". Composers assumed that, like other ornaments, performers would apply "bebung" at their discretion. Where sheet music does indicate "bebung", it appears as a series of dots above or below a note, with the number of dots indicating the number of finger movements. For example: <score>
Carl Philipp Emanuel Bach called the vibrato "Bebung", however other composers like Johann Mattheson had described the term earlier on. C.P.E Bach often used Bebung in his
More often, we see the whole score with the closing tag, but no opening tag.
There similar issues with other tags (#300) and table formatting (#298).
Several elements with non-textual content such as maps and musical scores (elements
mapframe
andscore
) are not filtered out. Steps to reproduce:https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2
wikiextractor --no-templates --html-safe '' -o - dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2 | grep '<\(score\|mapframe\)\b'
Output:
In general, the output also contains the content delimited by the tags (musical scores and map data). In some cases, both of the opening/closing tags (or parth the score itself) for musical scores are missing, e.g. article
id="152940"
fromdumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles3.xml-p151574p311329.bz2
contains only the opening<score>
:More often, we see the whole score with the closing tag, but no opening tag.
There similar issues with other tags (#300) and table formatting (#298).