attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.76k stars 969 forks source link

Tables are not entirely filtered out #298

Open adno opened 1 year ago

adno commented 1 year ago

Many tables (or parts of them) are still in the output.

Steps to reproduce:

  1. Download this dump: https://dumps.wikimedia.org/jawiki/20221020/jawiki-20221020-pages-articles1.xml-p1p114794.bz2
  2. Invoke the following command to list lines that contain the string "colspan": bzcat jawiki-20221020-pages-articles1.xml-p1p114794.bz2 | wikiextractor/WikiExtractor.py --no-templates -o - - | grep colspan

Output:

249||24||colspan="2"|-||9||0||258||24
21||1||1||0||colspan="2"|-||22||1
12||0||colspan="2"|-||colspan="2"|-||12||0
4||2||colspan="2"|-||colspan="2"|-||4||2
!rowspan="2"|通算!!colspan="2"|OFC
!colspan="4"|FIFA
!colspan="2" style="background-color:#efefef"|内容
!colspan="3"|小計
!colspan="3"|小計
!colspan="3"|小計
!colspan="3"|小計
!colspan="3"|小計
!colspan="3"|小計
!colspan="3"|小計
!colspan="3"|小計
 59 || 26 ||colspan="2"|-||colspan="2"|-|| 59 || 26
!colspan="4"|日本!!colspan="2"|リーグ戦!!colspan="2"|!!colspan="2"|天皇杯!!colspan="2"|期間通算

[shortened]