attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 965 forks source link

"revid" is incorrectly the page.revision.contributor.id, when it should be the page.revision.id #267

Open dnk8n opened 3 years ago

dnk8n commented 3 years ago

When parsing the following document, incorrect metadata is found.

revid should read 987507844 (link to correct article - https://en.wikipedia.org/wiki?curid=20460173&oldid=987507844 which is current at time of writing.

Once fixed, I would propose that an option could be supplied to fix links with revid, so that in future the outdated data will not drift away from the link supplied.

Screenshot from 2021-08-01 13-29-52

Erroneous result (using the --json flag, not tested without):

{'id': '20460173',
 'revid': '38890092',
 'url': 'https://en.wikipedia.org/wiki?curid=20460173',
 'title': 'Sayf al-Din Ghazi II',
 'text': "Sayf al-Din Ghazi (II) ibn Mawdud (; full name: Sayf al-Din Ghazi II ibn Mawdud ibn Zengi; died 1180) was a Zangid Emir of Mosul, the nephew of Nur ad-Din Zengi. \nHe became Emir of Mosul in 1170 after the death of his father Qutb ad-Din Mawdud. Saif had been chosen as the successor under the advice of eunuch ’Abd al-Masish, who wanted to keep the effective rule in lieu of the young emir; the disinherited son of Mawdud, Imad ad-Din Zengi II, fled to Aleppo at the court of Nur ad-Din. The latter, who was waiting for an excuse to annex Mosul, conquered Sinjar in September 1170 and besieged Mosul, which surrendered on 22 January 1171. After ousting al-Masish, he put Gümüshtekin, one of his officers, as governor, leaving Saif ud-Din nothing but the nominal title of emir. The latter also married the daughter of Nur ad-Din. \nAt Nur ad-Din's death (May 1174), Gümüshtekin went to Damascus to take control of his son and entitled himself of atabeg of Aleppo. Saif ud-Din rejected his tutorage and restored his independence. The nobles of Damascus, worried by Gümüshtekin's increasing power, offered Saif ud-Din their city, but he could not intervene since he was busy in retaking Mosul. Thenceforth Damascus was given to Saladin.\nSaladin took control of Biladu-Sham (Syria) but Saif ud-Din wanted to take over Aleppo, so he sent his brother Izz ad-Din Mas'ud at the head of an army to fight Saladin: they met in an area near Hama called Kron Hama (Arabic: قرون حماه) where Saif ud-Din was defeated. Later he prepared for another battle at Tell al-Sultan (Arabic: تل سلطان) near Aleppo, where he was also defeated; he went back to Mosul and sent messengers to Saladin offering his alliance, which was accepted.\nSaif ud-Din died from tuberculosis, and his brother Izz ad-Din Mas'ud succeeded him in 1180."}

Original XML:

  <page>
    <title>Sayf al-Din Ghazi II</title>
    <ns>0</ns>
    <id>20460173</id>
    <revision>
      <id>987507844</id>
      <parentid>987504428</parentid>
      <timestamp>2020-11-07T14:19:15Z</timestamp>
      <contributor>
        <username>Lettler</username>
        <id>38890092</id>
      </contributor>
      <comment>added [[Category:Tuberculosis deaths in Iraq]] using [[WP:HC|HotCat]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="3619" xml:space="preserve">{{Multiple issues|
{{Refimprove|date=March 2016}}
{{More footnotes|date=March 2016}}
}}
{{Infobox royalty
| type =
| title = [[Emir]] of [[Mosul]]
| name = Sayf al-Din Ghazi II
| more = 
| image = Dirham of Saif al-Din Ghazi II, 1171-1172.jpg
| caption = [[Dirham]] of Ghazi II Saif ud-Din minted in 1171/1172
| succession = 
| reign = 1170-1180
| coronation =
| predecessor = [[Qutb al-Din Mawdud]]
| successor = [[Izz ad-Din Mas'ud]]
| full name =Sayf al-Din Ghazi II ibn Qutb al-Din Mawdud ibn Imad al-Din Zengi
| house = [[Zengid Dynasty]]
| spouse=
| father = [[Qutb al-Din Mawdud]] 
| mother = 
| birth_date = 
| birth_place = 
| death_date = 1180
| death_place =
| place of burial =
| religion = [[Sunni Islam]]
}}

'''Sayf al-Din Ghazi (II) ibn Mawdud''' ({{lang-ar|سيف الدين غازي بن مودود|}}; full name: Sayf al-Din Ghazi II ibn [[Qutb al-Din Mawdud|Mawdud]] ibn [[Imad al-Din Zengi|Zengi]]; died 1180) was a [[Zangid]] [[Emir of Mosul]], the nephew of [[Nur ad-Din Zengi]]. 

He became [[Emir of Mosul]] in 1170 after the death of his father [[Qutb ad-Din Mawdud]]. Saif had been chosen as the successor under the advice of eunuch ’Abd al-Masish, who wanted to keep the effective rule in lieu of the young emir; the disinherited son of Mawdud, Imad ad-Din Zengi II, fled to [[Aleppo]] at the court of Nur ad-Din. The latter, who was waiting for an excuse to annex Mosul, conquered [[Sinjar]] in September 1170 and besieged Mosul, which surrendered on 22 January 1171. After ousting al-Masish, he put [[Gümüshtekin]], one of his officers, as governor, leaving Saif ud-Din nothing but the nominal title of emir. The latter also married the daughter of Nur ad-Din. 

At Nur ad-Din's death (May 1174), Gümüshtekin went to [[Damascus]] to take control of his son and entitled himself of atabeg of Aleppo. Saif ud-Din rejected his tutorage and restored his independence. The nobles of Damascus, worried by Gümüshtekin's increasing power, offered Saif ud-Din their city, but he could not intervene since he was busy in retaking Mosul. Thenceforth Damascus was given to [[Saladin]].

[[Saladin]] took control of Biladu-Sham ([[Syria]]) but Saif ud-Din wanted to take over [[Aleppo]], so he sent his brother [[Izz ad-Din Mas'ud]] at the head of an army to fight Saladin: they met in an area near [[Hama]] called Kron Hama (Arabic: قرون حماه) where Saif ud-Din was defeated. Later he prepared for another battle at [[Tell Sultan|Tell al-Sultan]] (Arabic: تل سلطان) near Aleppo, where he was also defeated; he went back to Mosul and sent messengers to Saladin offering his alliance, which was accepted.

Saif ud-Din died from [[tuberculosis]], and his brother [[Izz ad-Din Mas'ud]] succeeded him in 1180.&lt;ref&gt;{{cite book|first=Amin |last=Maalouf|title=The Crusades Through Arab Eyes|url=https://archive.org/details/crusadesthrougha00maal_0 |url-access=registration |year=1985 }}
&lt;/ref&gt;

== References ==
{{reflist}}

==Sources==
*{{cite book|last=Grousset|title= Histoire des croisades et du royaume franc de Jérusalem – II. 1131–1187 L'équilibre|year=1935}}

{{s-start}}
{{s-reg|}}
{{succession box|title=[[List of Emirs of Mosul|Emir of Mosul]]|before=[[Qutb al-Din Mawdud]]|years=1170–1180|after=[[Izz al-Din Mas'ud]]}}
{{s-end}}

{{DEFAULTSORT:Ghazi 02 Saif Ud-Din}}
[[Category:1180 deaths]]
[[Category:Zengid emirs of Mosul]]
[[Category:Muslims of the Crusades]]
[[Category:12th-century deaths from tuberculosis]]
[[Category:Year of birth unknown]]
[[Category:12th-century monarchs in the Middle East]]
[[Category:Tuberculosis deaths in Iraq]]
{{Authority control}}</text>
      <sha1>nug6ylcw7gnj3wtkj6iyki0tz2zt36h</sha1>
    </revision>
  </page>
dnk8n commented 3 years ago

I was nearly going to help out with a fix until I saw that parsing is happening in regex.

Why not parse with an XML parser? Since it is XML...

Clearly the wrong group is being captured.

If regex is really required, then this library could do with some simple tests to check that what is derived for these values with XML parsing, matches.

I am unsure if I caught an anomoly or if it is wrong with consistancy.

dnk8n commented 3 years ago

Funnily enough, I fell into exactly the same trap while writing my own parsing of the wiki files... about to work it out. I don't use regex though, I use xml.sax... so my solution will not be applicable.