Open miguelwon opened 1 year ago
Found this issue when analysing the result of the page Diffraction. ID: 8603 In section "Patterns" there are three bullet points:
The angular spacing of the features... ...
These bullet points are ignore and not included in the final cleaned text. I think is because of the asterisk.
To replicate:
I extracted the page with extractPage, then created a new file with the single page from its output. Then executed the WikiExtractor.
extractPage
WikiExtractor
python -m wikiextractor.extractPage --id 8603 enwiki-latest-pages-articles-multistream.xml.bz2
python -m wikiextractor.WikiExtractor page_8603.xml --json -o teste
Found this issue when analysing the result of the page Diffraction. ID: 8603 In section "Patterns" there are three bullet points:
These bullet points are ignore and not included in the final cleaned text. I think is because of the asterisk.
To replicate:
I extracted the page with
extractPage
, then created a new file with the single page from its output. Then executed theWikiExtractor
.python -m wikiextractor.extractPage --id 8603 enwiki-latest-pages-articles-multistream.xml.bz2
python -m wikiextractor.WikiExtractor page_8603.xml --json -o teste