Police-Data-Accessibility-Project / scrapers

Code relating to scraping public police data.
https://pdap.io
GNU General Public License v3.0
157 stars 33 forks source link

Create Extraction Metadata when scraped data is submitted #154

Closed josh-chamberlain closed 1 year ago

josh-chamberlain commented 2 years ago

Related to https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/issues/80, #173

Tasks

General purpose

This is a Python module called something like extraction_metadata.py in /common which generates metadata on the fly by using the dolthub API to get the most up to date information about the scraper at the time it's run.

Pinging the DoltHub API

Because scrapers and datasets are subject to change constantly, this should be be done on-the-fly.

This is python3 which gets all the agencies. We should still make a more useful query which just needs to substitute in the dataset ID.

import requests
url = "https://www.dolthub.com/api/v1alpha1/pdap/datasets/master?q=SELECT%20*%20FROM%20%60agencies%60"
response = requests.get(url)
data = response.json()
print(data)

Sample metadata

{
    "agency":{
        "agency_id": "73e93439e6bf4ffc8b3f931a86fa3ad0",
        "agency_name":"Clanton Police Department",
        "agency_coords":{"lat": "32.83853", "lng":"-86.62936"},
        "agency_type" : 4,
        "city":"Clanton",
        "state": "AL",
        "zip":"35045",
        "county_fips":"01021"
    },
    "dataset":{
          "dataset_id": "5740697099a311ebab258c8590d4a7fc",
          "url":"https://cityprotect.com/agency/540048e6ee664a6f88ae0ceb93717e50",
          "full_data_location":"data/cityprotect",
          "source_type": 3,
          "data_type": 10,
          "format_type": 2
    }
   "extraction":{
        "extraction_start":DATETIME,
        "extraction_finish":DATETIME,
        "dataset_archive":URL,
    }
}
CaptainStabs commented 2 years ago

Code I used to create queries based on a user's input for the GUI:

# Get user input from text box
homepage_url = self.homepageURLSearch_input.text()
owner, repo, branch = 'pdap', 'datasets', 'master'

# Create query
query = f'''SELECT * FROM `agencies` WHERE `homepage_url` LIKE "%{homepage_url}%"'''

# Send query to dolthub
res = requests.get('https://www.dolthub.com/api/v1alpha1/{}/{}/{}'.format(owner, repo, branch), params={'q': query})

# Get response as json
jsoned = res.json()

 # Filter out everything except the "rows" table
expression = jmespath.compile("rows[]")
self.searched = expression.search(jsoned)

https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/blob/main/setup_gui/ScraperSetup.py#L109