Police-Data-Accessibility-Project / scrapers

Code relating to scraping public police data.
https://pdap.io
GNU General Public License v3.0
160 stars 37 forks source link

Create Archive snapshot of dataset url when Scrapers are run #180

Closed josh-chamberlain closed 2 years ago

josh-chamberlain commented 2 years ago

This should ping the internet archive with a request to archive the site at the time the scraper should run.

from archive-it:

While we would love to have y’all as an Archive-It partner, I think this specific request may be better suited for our Wayback Machine’s "Save Page Now" (SPN) functionality. I’ve found a few resources on SPN API integrations that might fit your needs. Here is the standard API info page: https://archive.org/help/wayback_api.php. Here is our developer wiki: https://archive.readme.io/docs/overview. I also found this resource for a python wrapper for SPN: https://github.com/palewire/savepagenow.

Please let me know if you find something here that works for you so I can share with the team and anyone else who may have a similar request in the future! If not, I can reach out to some of my colleagues in our patron services division to see if they have other suggestions, or simply connect you someone.

CaptainStabs commented 2 years ago

The main issue that I can foresee is that the scraper will have to know its ID before it's run in order to pass it to the archival script. The archival script needs to know either the dataset_id or scraper_id so that it can access the proper dataset within the schema and add the archived URL there. Remember that the schema is designed for there to be multiple scrapers within it.

Instead of the example that I made in discord here, the archives will need to be linked back to the dataset like this:

I'm also not sure about calling it extractions. I think archives or runs would be better.

(The agency portion of the data has been removed for clarity)

{
    "agency_id": "",
    "agency_info":{"removed"},
    "data": [
        {
            "dataset_id": "0f37949c84e54dec8a9ccb3e94038308",
            "url": "https://acworthpolice.org/departments/annual-report",
            "full_data_location": "./data/",
            "source_type": 2,
            "data_type": 24,
            "format_type": 1,
            "update_freq": 1,
            "last_modified": "2022-05-25 18:40:50 UTC",
            "scraper_path": "./test/state/county/municipal/city/_scraper.py",
            "scraper_id": "c3c8a8de0a93407aa1aaa6a4b6e4c8a2",
            "mapping": {}
        }
    ],
    "archives":[
        {
            "dataset_id": "0f37949c84e54dec8a9ccb3e94038308",
            "run_start": "2020-05-19 12:21:04 GMT",
            "run_end": "2020-05-19 12:21:04 GMT",
            "website_archive": "wayback url"
        }
    ]
}
josh-chamberlain commented 2 years ago

If the scraper needs to know its ID before it can be run, it sounds like the process goes something like this:

  1. someone writes a scraper.
  2. part of the scraper.py file includes reference to the scraper's UUID, which is required for submission.
  3. part of the scraper process includes a requirement to add a scraper ID to the scrapers table in DoltHub, linking the scraper and dataset by ID in our relational database.
  4. scraper submissions can only be accepted to our intake API if the scraper ID exists in the DoltHub database (this would be validated as part of running the scraper).

so a scraper can be run without this UUID, but the results will not be submitted to the API. I could download an run a scraper for personal use in this scenario.

~as far as terminology, that's fine—archives is sensible. I will need to do some work to change "Extractions" to "Archive" everywhere, which I can take care of in the next couple days.~

i think dataset_snapshot makes more sense than website_archive. we have a particular definition for "dataset" so this terminology is aligned.

CaptainStabs commented 2 years ago

Actually, I take back what I said. I should be able to match the scraper based on the url it provides to the archival script in the schema.

Okay this works:

def create_metadata(url):
    if not os.path.exists("schema.json"):
        print("Please create a schema.json file in the same directory as this script and fill it out with the schema generator.")
        return
    else:
        with open("schema.json", "r+", encoding="utf-8") as schema_out:
            data = json.load(schema_out)
            agency_data = data["data"]
            # Find dictionary in "data" list that matches the URL
            dataset_id = next((item for item in agency_data if item["url"] == url))
            dataset_id = dataset_id["dataset_id"]
CaptainStabs commented 2 years ago

as far as terminology, that's fine—archives is sensible. I will need to do some work to change "Extractions" to "Archive" everywhere, which I can take care of in the next couple days.

Extractions would be the raw data, not the archive of the source of the dataset

i think dataset_snapshot makes more sense than website_archive. we have a particular definition for "dataset" so this terminology is aligned. 👍

CaptainStabs commented 2 years ago

latest branch has the code for this

josh-chamberlain commented 2 years ago

@CaptainStabs oh, I see. Yes, I agree, archive represents what we create with internet archive, and extractions are the raw data generated by the scrapers.

josh-chamberlain commented 2 years ago

@CaptainStabs Now that I have seen the branch, I think dataset_archive makes the most contextual sense.