attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 965 forks source link

Option to remove blank pages? #303

Open AngledLuffa opened 1 year ago

AngledLuffa commented 1 year ago

Recent versions of the Wikipedia dumps have blank pages in them. For example, the English one as of 2023-02-01 now starts with AccessibleComputing, which is a redirect to "Computer accessibility". This results in a blank page in the extracted wikipedia:

<doc id="10" url="https://en.wikipedia.org/wiki?curid=10" title="AccessibleComputing">
AccessibleComputing

</doc>

Is it possible to eliminate those, perhaps as an option?

hxy-62 commented 6 months ago

same question