attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.69k stars 959 forks source link

Parsing seems to exclude some part of the page #324

Open franluca opened 8 months ago

franluca commented 8 months ago

Thanks for the great library!

I noticed that the resulting entries may miss some meaningful content, e.g.

{"id": "75159532", "revid": "39374154", "url": "https://en.wikipedia.org/wiki?curid=75159532", "title": "Tyszko", "text": "Tyszko is a surname. Notable people with the surname include: "}

is missing the list of notable people.

I'm using standard the command

python -m', wikiextractor.WikiExtractor <dump name> --json -o <output folder>

Am I missing something?

Thanks again, Luca