When extracting to JSON format a wikidump:
python -m wikiextractor.WikiExtractor -o simpleWikipedia --templates template --json --bytes 200M simplewiki-20220901-pages-articles.xml.bz2
I would like to remove all subsections titles/headers and keep only textual paragraphs of the corpus (e.g. remove "The Month" and "April in poetry" titles from this page: https://simple.wikipedia.org/wiki/April)
Would there be any option or simple fix in the code to do in order to discard headers/titles?
Hi,
When extracting to JSON format a wikidump:
python -m wikiextractor.WikiExtractor -o simpleWikipedia --templates template --json --bytes 200M simplewiki-20220901-pages-articles.xml.bz2
I would like to remove all subsections titles/headers and keep only textual paragraphs of the corpus (e.g. remove "The Month" and "April in poetry" titles from this page: https://simple.wikipedia.org/wiki/April)
Would there be any option or simple fix in the code to do in order to discard headers/titles?
Thanks!