JonasGreim / US-headquarter-locations

1 stars 0 forks source link

how to get the headquarters of each company? #2

Open JonasGreim opened 1 month ago

JonasGreim commented 1 month ago

If we scrape this dataset (https://money.cnn.com/magazines/fortune/fortune500_archive/full/1955/)

how to connect these companies to their headquarter location? -> find dataset where all headquarters are listet? -> generate wikipedia link to extract the location? (wikipedia have the geloaction data for each headquarter)

JonasGreim commented 1 month ago

maybe with wikipedia: -> checkout wiki api? maybe without scraping

-> General Motors? -> https://en.wikipedia.org/wiki/General_Motors -> right hand side in the fact box: Headquarters: [Detroit, Michigan]

but link is case sensitive and could be differently

-> could be done also with scrapy -> for each entry -> generate link and scrape this box ->set company to a list with all company headquarters done/checked (no repetion)

if no wiki entry found: then search on wiki with the company name and click on first result

maybe try it first with 50 companies and then with the hole set

JonasGreim commented 1 month ago

dataset of the current top 500 companies. But I found none with all companies of all time.

https://www.kaggle.com/datasets/sanjanapatil7/largest-companies-in-usa-by-revenue

JonasGreim commented 1 month ago

wikipedia official api is trash(https://www.mediawiki.org/wiki/API:Main_page).

but maybe this python wrapper: https://pypi.org/project/Wikipedia-API/

or we build in a proxy with scrapy: (that we don t get api blocked) https://www.youtube.com/watch?v=090tLVr0l7s

JonasGreim commented 1 month ago

How many unique companies have there been in the 50 years?

top 100 -> 374 unique companies top 50 -> 177 unique companies top 25 -> 96 unique companies top 10 -> 29 unique companies

JonasGreim commented 1 month ago

https://github.com/barrust/mediawiki

wiki wrapper with function: wikipedia.geosearch(title='washington, d.c.') wikipedia.geosearch(latitude='0.0', longitude='0.0')**

robinsonlang22 commented 1 month ago

Wikidata Query Service Using "Wikidata Query Service" to get the name of cities of headquarters from headquartersLabel uscities.csv There is a dataset which can correspond the city name to its coordinate

JonasGreim commented 1 month ago

wikidata looks interesting is it possible to get the headquarters ? it should be the headquarters is a data tag.

how would the query look like?

is it possible to do multiple queries with an api?

robinsonlang22 commented 1 month ago

the query will be like: SPARQL query Yes, you can

JonasGreim commented 3 weeks ago

Maybe we could give chatGpt all company names as list -> he could transform the names into the official names

But I also think there is a small problem with the tags, this is not found with "Altria Group" https://www.wikidata.org/wiki/Q445007

JonasGreim commented 3 weeks ago

Maybe is dbpedia better for us. Because dbpedia has a full text search and wikidata doesn't. https://dbpedia.org/sparql

Wikidata and DBpedia are both projects that aim to make information from Wikipedia more accessible and usable by computers, but they go about it in different ways:

Data Source: DBpedia: Extracts data from the infoboxes (information templates) on Wikipedia articles. This process is automated. Wikidata: Creates a separate knowledge base where users can manually enter information about entities (people, places, things).

Focus:
    DBpedia: Focuses on generating Linked Open Data (LOD) directly from Wikipedia.
    Wikidata: Focuses on creating a central repository of structured data that can be used by Wikipedia and other projects.