attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 964 forks source link

Capture page categories #130

Open hakanw opened 7 years ago

hakanw commented 7 years ago

I haven't found any way to keep the categories a page belongs to, is there one that I'm just missing?

ideabrdg commented 7 years ago

+1

MiladAlshomary commented 6 years ago

+1

urmi22 commented 6 years ago

any workaround?

ckot commented 5 years ago

I've looked at the code, but don't quite grok it yet. I think that categories are simply internals links. I see in the code where external links are filtered out, and then internal links are filtered out. Perhaps just prior to the removal of internal links, the category links could be extracted and either added to the output as

<doc .....>
  <categories>
     <category>Foo</category>
      <category>Bar</category>
   </categories>
</doc>

or for JSON output as field

  categories: ['Foo', 'Bar']

or perhaps preferably as a separate output file, (whether to include in file or separate output file could be a cmd-line option) either xml/json (consistent with which output format the rest of the data is output) which indexes categories to doc ids.

Before I'd add such a feature, I just need to verify (does anyone know) if categories are indeed just a subset of 'internal links'.

ckot commented 5 years ago

I see an existing, unapplied PR which adds a 'filter by category' feature, and see it simply adds a regexp to capture the categories. I'll try that approach to capture categories and if successful will post a PR

dvirginz commented 4 years ago

Any solution?

ckot commented 4 years ago

@dvirginz I ended up using chartbeat-labs/textacy instead as they were quicker to respond to pull requests, although wiki download extraction is only a tiny portion of what that lib does.

dvirginz commented 4 years ago

@ckot ! thanks a lot for your answer on a post from 2018, I thought there is no chance I'll get an answer. Have you found (or know a way) of getting a top-level category? As Obama for example is marked as "Politicians from Chicago" where there is also a "politicians" category, which is much more beneficial for me.

ckot commented 4 years ago

Oh god this brings me back. I don’t know what wiki author guidelines are, but if the Obama article’s authors were following them, it would seem that they associate the most specific (lowest level) categories with an article, and wikipedia’s category hierarchy will allow a user to “drill up” the category hierarchy. For example, if you scroll down the page to the categories, you’ll find that there are some categories, which if clicked on, scroll down to the bottom of their page to where the categories are and click on one of those, you will eventually end up on a ‘People by Occupation’ or something like that.

I believe the main articles have a 0 and the category pages are 14. To process one or the other, you’ll need to filter by that field.

What I did was first process the category namespace so that I could associate categories with their subcategories. Being a python programmer, I used networkX to model the network hierarchy as a graph. Note: there WILL be cycles. It does not result in a DAG.

I created a root node with subcategoriee: https://en.wikipedia.org/wiki/Category:Main_topic_classifications https://en.wikipedia.org/wiki/Category:Main_topic_classifications

I then processed the article namespace, associating their [[Category:*]] links with the nodes in my graph.

What I was playing around with was selecting an arbitrary depth of category spcecificity I cared about, say 3 or 4 I then, starting withthe deepest categories in the hierarchy was to promote them up one node in the hierarchy until everything was at the max depth I decided upon. As I did this I updated my indices of what category ids the articles were associated with.

From there, I was for any article’s set of concepts, make use of some networkX function which would find the shortest path (avoiding cycles) between each of it’s concept nodes and the root, and I would end up with a manageable? set of paths I could work with.

For example, you might end up with”

/people/people_by_occupation/politicians /people/people_by_nationality/american ….

This ended up being quite difficult though, and I’m not sure if the reason it sounds easy is due to my wrestling with it for quite a while, or if I’m forgetting about some complexities involved. Also, I never finished my project, so take this all with a grain of salt, I don’t want to send you down a rabbit hole. I believe my problem was due to the fact that I was trying to use the categories as features for machine learning and felt that I still had too many categories, rather than my not being able to produce reliable top-level categories for an article.

Good luck!

-Scott

On Apr 12, 2020, at 1:30 PM, Dvir Ginzburg notifications@github.com wrote:

@ckot https://github.com/ckot ! thanks a lot for your answer on a post from 2018, I thought there is no chance I'll get an answer. Have you found (or know a way) of getting a top-level category? As Obama for example is marked as "Politicians from Chicago" where there is also a "politicians" category, which is much more beneficial for me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/attardi/wikiextractor/issues/130#issuecomment-612649717, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIKLQIDRP64LHYWQ6LNEELRMH3DRANCNFSM4DI5Z7OQ.

dvirginz commented 4 years ago

@ckot That's awesome, thanks for sharing your inputs. I decided to take a different route and create the labels through a different dataset. For example, taking the movie lens dataset, as the movie_name<->categories info and match it to the relevant article (as similar to politicians, in movies, the category of "the godfather" is "1970s crime films" and not simply "crime films", but according to movies datasets the situation is a bit easier (for good and worst..).

Again, thanks for the thorough answer

dvirginz commented 4 years ago

processed the article namespace, associating their [[Category:*]] links with the nodes in my graph. What I was playing around with was selecting an ar

Sorry for bothering you again, just to make sure I got you right, you've never succeeded to extract the "parent categories" right? I.e extract from politicians people. I understand it's not a DAG, nor single source graph, but thought maybe their API gives that option.

ckot commented 4 years ago

I'm afraid I'm not any help regarding that. I'm familiar with processing the giant xml dump files to access the text of the entire wikipedia website, not using an API which accesses data for pages individually using their API.

On Sun, Apr 19, 2020 at 8:30 AM Dvir Ginzburg notifications@github.com wrote:

processed the article namespace, associating their [[Category:*]] links with the nodes in my graph. What I was playing around with was selecting an ar

Sorry for bothering you again, just to make sure I got you right, you've never succeeded to extract the "parent categories" right? I.e extract from politicians people. I understand it's not a DAG, nor single source graph, but thought maybe their API gives that option.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/attardi/wikiextractor/issues/130#issuecomment-616122485, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIKLQJLFSQ7KBRODH2N5ADRNLVH3ANCNFSM4DI5Z7OQ .

dnk8n commented 3 years ago

I'm contemplating using some of the parsing tips here: https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c, to simply record ids that match the categories I care about.

Then use wikiextractor (as a subprocess) and only save if ID is found in the list above.

Probably quite dumb. But I think it would work for my purposes.


Another way is probably to have mediawiki running successfully (but that is much more work for maybe not too much gain?) and doing a call using the "url" to local server and getting category from script tag (e.g. use developer tools on a wiki page with categories and search for 'wgCategories')

Mediawiki method e.g. use ID revealed by wikiextractor to parse https://en.wikipedia.org/?curid=20460533 (but sub in the base URL of your local mediawiki server)

Notice some extra odd looking categories, compared to other method (they look pretty meta, can they safely be disregarded?):

'CS1: Julian–Gregorian uncertainty'
'Use British English from February 2012'
'Use dmy dates from February 2021'
'AC with 0 elements'
'All stub articles'
'Church of England bishop stubs',
'Canadian bishop stubs'

This is what wgCategories holds:

{'wgCategories': ['CS1: Julian–Gregorian uncertainty',
                  'Use British English from February 2012',
                  'Use dmy dates from February 2021',
                  'AC with 0 elements',
                  'All stub articles',
                  '1848 births',
                  '1934 deaths',
                  'Anglican bishops of British Columbia',
                  'Bishops of Willesden',
                  'Alumni of Trinity College, Oxford',
                  '19th-century Anglican Church of Canada bishops',
                  '20th-century Church of England bishops',
                  'Burials at St John-at-Hampstead',
                  'Freemasons of the United Grand Lodge of England',
                  '20th-century Anglican Church of Canada bishops',
                  'Church of England bishop stubs',
                  'Canadian bishop stubs']}

e.g If I was interested in "Bishops" category, this page would have matched.

For subcategories you could probably parse and expand the main category pages you are after and go a few layers deep (experimentation probably required)

Similarly, the more simple Parsing Method (see the following link)

The corresponding XML page as above, from the latest dump, looks like this:

  <page>
    <title>William Perrin (bishop)</title>
    <ns>0</ns>
    <id>20460533</id>
    <revision>
      <id>1018042021</id>
      <parentid>1012358087</parentid>
      <timestamp>2021-04-16T00:16:58Z</timestamp>
      <contributor>
        <username>Citation bot</username>
        <id>7903804</id>
      </contributor>
      <comment>Alter: title. Removed parameters. | [[WP:UCB|Use this bot]]. [[WP:DBUG|Report bugs]]. | Suggested by Abductive | [[Category:1848 births]] | via #UCB_Category 904/2182</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="6361" xml:space="preserve">{{Use British English|date=February 2012}}
{{Use dmy dates|date=February 2021}}
{{Infobox Christian leader
| name             = William Perrin
| title            = [[Bishop of Willesden]]
| image            = 
| image_size       = 
| alt              = 
| caption          = 
| diocese          = [[Diocese of London]]
| elected          = 
| term             = 1911–1929 (ret.)
| enthroned        = 
| quashed          = 
| term_end         = 
| predecessor      = 
| opposed          = 
| successor        = [[Guy Smith (bishop)|Guy Smith]]
| other_post       = 
{{unbulleted list|[[Anglican Diocese of British Columbia|Bishop of British&amp;nbsp;Columbia]] {{nowrap|(1893–1911)}}|Rector of [[St&amp;nbsp;Andrew Undershaft]] {{nowrap|(1912–1934)}}|[[Assistant&amp;nbsp;Bishop of London]] {{nowrap|(1929–1934)}}}}
&lt;!---------- Orders ----------&gt;
| ordination       = 1870
| ordained_by      = 
| consecration     = 1893
| consecrated_by   = [[Edward White&amp;nbsp;Benson]] (Canterbury)
&lt;!---------- Personal details ----------&gt;
| birth_name       = 
| birth_date       = {{birth date|1848|8|11|df=y}}
| birth_place      = [[Westbury-on-Trym]], [[Somerset]],&lt;!--as was--&gt; UK
| death_date       = {{death date and age|1934|6|27|1848|8|11|df=y}}
| death_place      = 
| buried           = 
| nationality      = 
| religion         = [[Anglicanism|Anglican]]
| residence        = 
| parents          = 
| spouse           = 
| children         = 
| occupation       = 
| profession       = 
| education        = 
| alma_mater       = [[King's College London]]
}}
'''William Willcox Perrin''' (11 August 1848{{snd}}27 June 1934) was an [[Anglican]] bishop in the late 19th and early 20th centuries.

Perrin was born at [[Westbury-on-Trym|Westbury-on-Trym, Somersetshire]], on 11 August 1848 and educated at both [[King's College London]] and [[Trinity College, Oxford]].&lt;ref name=&quot;CalHer1911&quot; /&gt;&lt;ref name=&quot;Who's Who&quot; /&gt; Ordained in 1870, he began his ministry with a [[Curate|curacy]] at St&amp;nbsp;Mary's [[Southampton]] and was then [[vicar]] of St&amp;nbsp;Luke's in the same city before his ordination to the [[episcopate]] as the [[Anglican Diocese of British Columbia|Bishop of British Columbia]].&lt;ref&gt;{{cite book | last=[[Richard Malden|Malden Richard (ed)]] | author-link= | title= Crockford's Clerical Directory for 1920 (51st edn) | location= London | publisher= The Field Press| pages=1630| year=1920 | isbn=}}&lt;/ref&gt; He was consecrated a bishop on 24 March 1893, by [[Edward White&amp;nbsp;Benson]], [[Archbishop of Canterbury]], at [[Westminster Abbey]].&lt;ref&gt;{{Church Times | title = Consecration of bishops | archive = 1893_03_30_347 | issue = 1575 | date = 30 March 1893 | page = 347 | accessed = 15 March 2021 }}&lt;/ref&gt; He was later [[Translation (ecclesiastical)|translated]] to be the [[Bishop of Willesden]]. During this period he was also the [[Rector (ecclesiastical)|rector]] of [[St&amp;nbsp;Andrew Undershaft]]&lt;ref name=&quot;Who's Who&quot; /&gt; A noted [[Freemason]]&lt;ref name=&quot;Anonymous2003&quot; /&gt; (he kept the rectory until his death).&lt;ref name=&quot;mem&quot; /&gt; He died on 27 June 1934&lt;ref name=&quot;TheTimes&quot; /&gt; and is buried in the churchyard of [[St&amp;nbsp;John-at-Hampstead Church]], London. His sister Edith was a prominent social reformer.&lt;ref name=&quot;Hale1994&quot; /&gt;

Perrin unveiled and dedicated the [[Hampstead War Memorial]] in May 1922.&lt;ref name=&quot;NHLE&quot; /&gt;

He retired in summer 1929,&lt;ref&gt;{{Church Times | title = Bishop of Willesden | archive = 1929_03_01_245 | issue = 3449 | date = 1 March 1929 | page = 245 | accessed = 24 September 2020 }}&lt;/ref&gt; resigning his See in time for his successor's consecration on the [[James the Great|Feast of St&amp;nbsp;James]] (25&amp;nbsp;July).&lt;ref&gt;{{Church Times | title = New Bishop of Willesden | archive = 1929_07_26_108 | issue = 3470 | date = 26 July 1929 | page = 108 | accessed = 24 September 2020 }}&lt;/ref&gt; He became an [[Assistant&amp;nbsp;Bishop of London]] until his death&lt;ref name=&quot;mem&quot;&gt;{{Church Times | title = in memoriam: Bishop Perrin | archive = 1934_06_29_794 | issue = 3727 | date = 29 June 1934 | page = 794 | accessed = 24 September 2020 }}&lt;/ref&gt; — he apparently retained oversight of Hampstead deanery throughout.&lt;ref&gt;{{Church Times | title = Church news | archive = 1934_08_10_137 | issue = 3733 | date = 10 August 1934 | page = 137 | accessed = 24 September 2020 }}&lt;/ref&gt;

==References==
{{reflist|2|refs=

&lt;ref name=&quot;CalHer1911&quot;&gt;{{cite news |title=British Columbia To Lose Noted Bishop |work=The Calgary Herald |agency=Canadian Associated Press |date=1911-08-09 |page=11}}&lt;/ref&gt;

&lt;ref name=&quot;Anonymous2003&quot;&gt;{{cite book|author=Anonymous |title=Representative British Freemasons|url=https://books.google.com/books?id=6DtvwH_Dfl8C&amp;pg=PA109|date=January 2003|publisher=Publishing|isbn=978-0-7661-3589-5|pages=109–}}&lt;/ref&gt;

&lt;ref name=&quot;Hale1994&quot;&gt;{{citation|url= http://www.biographi.ca/en/bio/perrin_edith_13E.html|first=Linda L.|last= Hale|title=PERRIN, EDITH|work=Dictionary of Canadian Biography|volume= 13|publisher=University of Toronto/Université Laval|date=1994|access-date=21 October 2019}}&lt;/ref&gt;

&lt;ref name=&quot;Who's Who&quot;&gt;{{Who's Who|id=215417|surname=Perrin|othernames=William Willcox}}&lt;/ref&gt;

&lt;ref name=&quot;TheTimes&quot;&gt;{{Cite newspaper The Times|date= 28 June 1934|p= 19|issue =46792|column= A |title=Obituary- Bishop Perrin, Columbia And Willesden}}&lt;/ref&gt;

&lt;ref name=&quot;NHLE&quot;&gt;{{NHLE|num=1423688|desc=Hampstead War Memorial|access-date=27 June 2017|mode=cs2}}&lt;/ref&gt;

}}

{{s-start}}
{{s-rel|en}}
{{s-bef|before=[[George Hills]]}}
{{s-ttl|title=[[Anglican Diocese of British Columbia|Bishop of British Columbia]]|years=1893–1911}}
{{s-aft|after=[[Charles Roper]]}}
{{s-new}}
{{s-ttl|title=[[Bishop of Willesden]]|years=1911–1934}}
{{s-aft|after=[[Guy Smith (bishop)|Guy Smith]]}}
{{s-end}}
{{Anglican Bishops of British Columbia}}
{{Bishops of Willesden}}
{{authority control}}
{{DEFAULTSORT:Perrin, William Willcox}}
[[Category:1848 births]]
[[Category:1934 deaths]]
[[Category:Anglican bishops of British Columbia]]
[[Category:Bishops of Willesden]]
[[Category:Alumni of Trinity College, Oxford]]
[[Category:19th-century Anglican Church of Canada bishops]]
[[Category:20th-century Church of England bishops]]
[[Category:Burials at St John-at-Hampstead]]
[[Category:Freemasons of the United Grand Lodge of England]]
[[Category:20th-century Anglican Church of Canada bishops]]
{{ChurchofEngland-bishop-stub}}
{{Canada-bishop-stub}}</text>
      <sha1>o0c7vb4k0dpi0dfu8ntwzs7yzgkvk8d</sha1>
    </revision>
  </page>

You could parse the [[Category:.*]] snippets to gain the categories and append the id to list if there is a match. I am not sure how long it would take to go through the entire wikipedia locally.

I will try link to a Jupyter Notebook with this working. Please let me know if my thinking is flawed.

P.S. I don't have the patience to setup mediawiki local server so I will only be doing the simple method. Even if longer compute time, I am sure it will save in dev time!

P.P.S For completeness, this is what wikiextractor parses, great job... super excited to find this library!

<doc id="20460533" url="?curid=20460533" title="William Perrin (bishop)">
William Perrin (bishop)

William Willcox Perrin (11 August 184827 June 1934) was an Anglican bishop in the late 19th and early 20th centuries.
Perrin was born at Westbury-on-Trym, Somersetshire, on 11 August 1848 and educated at both King's College London and Trinity College, Oxford. Ordained in 1870, he began his ministry with a curacy at St Mary's Southampton and was then vicar of St Luke's in the same city before his ordination to the episcopate as the Bishop of British Columbia. He was consecrated a bishop on 24 March 1893, by Edward White Benson, Archbishop of Canterbury, at Westminster Abbey. He was later translated to be the Bishop of Willesden. During this period he was also the rector of St Andrew Undershaft A noted Freemason (he kept the rectory until his death). He died on 27 June 1934 and is buried in the churchyard of St John-at-Hampstead Church, London. His sister Edith was a prominent social reformer.
Perrin unveiled and dedicated the Hampstead War Memorial in May 1922.
He retired in summer 1929, resigning his See in time for his successor's consecration on the Feast of St James (25 July). He became an Assistant Bishop of London until his death — he apparently retained oversight of Hampstead deanery throughout.

</doc>
dnk8n commented 3 years ago

I will try link to a Jupyter Notebook with this working. Please let me know if my thinking is flawed.

Edited it to include output. You can see it adds about 50% of the time of a full templates processed run. But I think if the category processing was integrated into the wikiextractor project it would not add much time at all.

You can download the output on Kaggle here

Here is the notebook that created the data, with output. It works as I imagined, however this is very rough.

I hope it serves as a proof of concept that category and correct revid information can be captured.

Requires the installation of bzcat and of course wikiextractor (which is run as a subprocess). Please contact me with questions. You can find me @dnk8n on twitter/IG for example.

dnk8n commented 3 years ago

If useful to easily install the wiki dumps that worked with the above code, see https://gist.github.com/dnk8n/afcd8585865fa29abe625e8ecee94c68