attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.69k stars 959 forks source link

Get all revisions content #332

Open abrahami opened 1 month ago

abrahami commented 1 month ago

Hi! As far as I understand (and tried the code), the current implementation assumes that the input dump file contains a single revision per pageID. The historical dump files contain all revisions of a single page, and when this is given as input for the code, it generates long textual content without splitting it into revisions. Is there a simple way to "force" the code to take into account the different revisions per pageID?

thank you!