attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.69k stars 959 forks source link

Option to drop section titles/headers #293

Open Matthieu-Tinycoaching opened 1 year ago

Matthieu-Tinycoaching commented 1 year ago

Hi,

When extracting to JSON format a wikidump: python -m wikiextractor.WikiExtractor -o simpleWikipedia --templates template --json --bytes 200M simplewiki-20220901-pages-articles.xml.bz2

I would like to remove all subsections titles/headers and keep only textual paragraphs of the corpus (e.g. remove "The Month" and "April in poetry" titles from this page: https://simple.wikipedia.org/wiki/April)

Would there be any option or simple fix in the code to do in order to discard headers/titles?

Thanks!

Matthieu-Tinycoaching commented 1 year ago

Hi,

@attardi any idea on how to deal with these?

Thanks!