attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.76k stars 968 forks source link

Bullet points are missing in the final extracted text #321

Open miguelwon opened 1 year ago

miguelwon commented 1 year ago

Found this issue when analysing the result of the page Diffraction. ID: 8603 In section "Patterns" there are three bullet points:

  • The angular spacing of the features... ...

These bullet points are ignore and not included in the final cleaned text. I think is because of the asterisk.

To replicate:

I extracted the page with extractPage, then created a new file with the single page from its output. Then executed the WikiExtractor.

python -m wikiextractor.extractPage --id 8603 enwiki-latest-pages-articles-multistream.xml.bz2

python -m wikiextractor.WikiExtractor page_8603.xml --json -o teste