Ammarpad / OutreachyProject

Outreachy Internship Project. This repository contains the Internship work with Wikimedia Foundation. Mentored by Mike Peel. (https://mikepeel.net)
https://github.com/Ammarpad/OutreachyProject
MIT License
5 stars 1 forks source link
internship outreachy structured-data wikidata wikipedia

OOUTREACHY PROJECT

See OutreachyProposal for background.

NOTE: This is still a work in progress. If you see a bug or something wrong please do let me know. Thanks

This repository is a collection of python modules written as work for Outreachy internship with Wikimedia Foundation with guidance of Mike Peel, the mentor.

Development (is) being done with Python 3.8.0 and master branch of Pywikibot package. Some modules may requires additional libraries, where such is the case, is noted in the brief module note below.

  1. common.py
    • This is a meta module that contains the base logic and generic functions that all the other modules can use to avoid code duplication. It facilatates converting value to approrriate data type for wikibase needs as well as pushing the collected data to the data repository (Wikidata)
  2. official_website.py
    • This module extracts official website links from Wikipedia article and add them to corresponding data item of the page on the repo. This module uses BeautifulSoup library (4.9.3) apart from the standard requirments. It does not validate that the url is actually working, but it does ensure that it is valid URL in structure.
  3. twitter_username.py
    • This module primarily extracts Twitter usernames of subjects from Wikipedia page, or set of pages, and then use the username to extract its corresponding numeric id from Twitter. The username is then exported to Wikidata as Twitter username claim, and the numeric identifier as Numeric id qualifier. This module requires Twitter developer API key to work fully correctly.
  4. mb_release_group_data.py
  5. lepindex-id.py
    • This module extracts LepIndex (an dentifier for a Lepidoptera taxon in the UK Natural History Museum's 'Global Lepidoptera Names Index') from Wikipedia articles and stores them in the data repository. It can work with arbitrary page or set of pages (categorized) such as the set automatically generated by this wikipedia category.
  6. book_data.py
    • This modules can be used to extract and export multiple value statements from wikipedia articles about books to Wikidata. Presently it can process a single page or list of pages and primarily extract either one, two or all of these: OCLC number, ISBN number (both 10 and 13) as well as Number of pages. There's a basic validation for each value extracted to reduce chance of invalid values.
  7. power_stations.py
    • This modules extracts data from articles about Power stations on Wikipedia.
  8. find_a_grave-id.py
    • This modules works with Find a Grave dentifier. The relevant value is also extracted from Wikipedia and basic validation is applied. It is then exported to the corresponfing item of the wiki page as a Find A Grave memorial ID claim statement. The script, by default, loops through this relevant category on English Wikipedia
  9. theatre-venue-data.py
    • This modules extracts data from Wikipedia articles about stadia, arenas, other sporting venues, as well as theatres and cinemas.
  10. world_football_dot_net.py
  11. nft_data.py
  12. alumni_data.py
  13. game_data.py
  14. Next
  15. Next
  16. Next

LICENSE

The code in this responsitory is made available under the MIT LICENSE.