Closed JonasGreim closed 1 month ago
maybe with wikipedia: -> checkout wiki api? maybe without scraping
-> General Motors? -> https://en.wikipedia.org/wiki/General_Motors -> right hand side in the fact box: Headquarters: [Detroit, Michigan]
but link is case sensitive and could be differently
-> could be done also with scrapy -> for each entry -> generate link and scrape this box ->set company to a list with all company headquarters done/checked (no repetion)
if no wiki entry found: then search on wiki with the company name and click on first result
maybe try it first with 50 companies and then with the hole set
dataset of the current top 500 companies. But I found none with all companies of all time.
https://www.kaggle.com/datasets/sanjanapatil7/largest-companies-in-usa-by-revenue
wikipedia official api is trash(https://www.mediawiki.org/wiki/API:Main_page).
but maybe this python wrapper: https://pypi.org/project/Wikipedia-API/
or we build in a proxy with scrapy: (that we don t get api blocked) https://www.youtube.com/watch?v=090tLVr0l7s
How many unique companies have there been in the 50 years?
top 100 -> 374 unique companies top 50 -> 177 unique companies top 25 -> 96 unique companies top 10 -> 29 unique companies
https://github.com/barrust/mediawiki
wiki wrapper with function: wikipedia.geosearch(title='washington, d.c.') wikipedia.geosearch(latitude='0.0', longitude='0.0')**
Wikidata Query Service Using "Wikidata Query Service" to get the name of cities of headquarters from headquartersLabel uscities.csv There is a dataset which can correspond the city name to its coordinate
wikidata looks interesting is it possible to get the headquarters ? it should be the headquarters is a data tag.
how would the query look like?
is it possible to do multiple queries with an api?
the query will be like: SPARQL query Yes, you can
Maybe we could give chatGpt all company names as list -> he could transform the names into the official names
But I also think there is a small problem with the tags, this is not found with "Altria Group" https://www.wikidata.org/wiki/Q445007
Maybe is dbpedia better for us. Because dbpedia has a full text search and wikidata doesn't. https://dbpedia.org/sparql
Wikidata and DBpedia are both projects that aim to make information from Wikipedia more accessible and usable by computers, but they go about it in different ways:
Data Source: DBpedia: Extracts data from the infoboxes (information templates) on Wikipedia articles. This process is automated. Wikidata: Creates a separate knowledge base where users can manually enter information about entities (people, places, things).
Focus:
DBpedia: Focuses on generating Linked Open Data (LOD) directly from Wikipedia.
Wikidata: Focuses on creating a central repository of structured data that can be used by Wikipedia and other projects.
If we scrape this dataset (https://money.cnn.com/magazines/fortune/fortune500_archive/full/1955/)
how to connect these companies to their headquarter location? -> find dataset where all headquarters are listet? -> generate wikipedia link to extract the location? (wikipedia have the geloaction data for each headquarter)