JonasGreim / US-headquarter-locations

3 stars 0 forks source link

Company Headquarters in the USA

The objective of this university project is to demonstrate how the US industry has changed over time, as represented by the major corporate headquarters included in the SP500 or Fortune 500 indices. The geographical distribution of these headquarters and their respective industry sectors are presented on a map where users can select different years and choose between two indexes.

Table of Contents

Data

This repository is the data scraping and processing part of the project.

Visualisation

The visualization/mapping of the headquarters locations data can be found in this repository.

The final visualization result can be found as website here. readMeAppPreview.png

Getting Started:

To get a local copy up and running, follow these simple steps.

Installation

  1. Set up a virtual python environment (Python version >= 3.10):
    python3 -m venv venv
    source venv/bin/activate
  2. Install the required packages:
    - pip3 install -r requirements.tx

Run scraper: scrapy (Get SP500 and Fortune 500 rankings)

Info: The scraper only searches the existing annual SP500 and Fortune 500 rankings of the specified websites. There are no headquarters locations in the rankings.

go into the scrapy folder:

cd companyRankingsScraper

Fortune500:

scrapy crawl us-companies-fortune500 -o fortune500.json

SP500:

scrapy crawl us-companies-sp500 

Official Wikipedia API (Try to get headquarters locations)

First we tried to preserve the headquarters locations of the companies with the official Wikipedia API.

Problem:

run:

cd officialWikiApi
python3 officialWikiApi.py

Wikidata API (Get headquarters locations & industry sectors)

To access the headquarters location data of the companies, we used the Wikidata API.

How the location data processing works:

  1. 1_initUniqueComaniesJson.py:

    • Creates a unique company list from the ranking data
    • With the attributes: companyName, searchQueryCompanyName, wikiDataName, qid
    • The reason why you need these attributes is:
      • Wikidata name search is really inaccurate (and also the scraped data)
      • With these you can manually compare your search name (companyName) with the retrieved name (wikiDataName)
      • If wrong you can change the searchQueryCompanyName manually
      • the original scraped companyName is used to map the unique companies to the ranking data later on
  2. 2_getAllQidsThroughCompanyNameList.py:

    • Adds the wikidata qids to the unique company list (qid = wikidata page id)
    • Looping through the unique companies and retrieve the qid with the searchQueryCompanyName through the Wikidata API
    • Search: only text search in wikidata title and synonyms
    • API response: First qid result is taken (returns array)
  3. 3_getAllLocationDataThroughQIDList.py

    • Adds the headquarters location data to the unique company list
    • Get wikidata page data through the qid and extract the headquarters location
  4. 4_getAllIndustryDataThroughQIDList.py

    • Adds the industry sector data to the unique company list
    • Get wikidata page data through the qid and extract the industry sector
    • For our frontend, we categorized each company into one of ten industry sectors using ChatGPT.
  5. 5_mapUniqueCompaniesToRankingFortune500.py (or 5_mapUniqueCompaniesToRankingSp500.py)

    • Maps the modified unique companies data to the ranking data
  6. 6_createGeoJsonFortune500.py (or 6_createGeoJsonSp500.py)

    • Converts the json data to a GeoJSON format

Problems:

Solutions:

Notes

Credits