Closed josh-chamberlain closed 2 years ago
The main issue that I can foresee is that the scraper will have to know its ID before it's run in order to pass it to the archival script. The archival script needs to know either the dataset_id
or scraper_id
so that it can access the proper dataset within the schema and add the archived URL there. Remember that the schema is designed for there to be multiple scrapers within it.
Instead of the example that I made in discord here, the archives will need to be linked back to the dataset like this:
I'm also not sure about calling it extractions
. I think archives
or runs
would be better.
(The agency portion of the data has been removed for clarity)
{
"agency_id": "",
"agency_info":{"removed"},
"data": [
{
"dataset_id": "0f37949c84e54dec8a9ccb3e94038308",
"url": "https://acworthpolice.org/departments/annual-report",
"full_data_location": "./data/",
"source_type": 2,
"data_type": 24,
"format_type": 1,
"update_freq": 1,
"last_modified": "2022-05-25 18:40:50 UTC",
"scraper_path": "./test/state/county/municipal/city/_scraper.py",
"scraper_id": "c3c8a8de0a93407aa1aaa6a4b6e4c8a2",
"mapping": {}
}
],
"archives":[
{
"dataset_id": "0f37949c84e54dec8a9ccb3e94038308",
"run_start": "2020-05-19 12:21:04 GMT",
"run_end": "2020-05-19 12:21:04 GMT",
"website_archive": "wayback url"
}
]
}
If the scraper needs to know its ID before it can be run, it sounds like the process goes something like this:
scraper.py
file includes reference to the scraper's UUID, which is required for submission.so a scraper can be run without this UUID, but the results will not be submitted to the API. I could download an run a scraper for personal use in this scenario.
~as far as terminology, that's fine—archives
is sensible. I will need to do some work to change "Extractions" to "Archive" everywhere, which I can take care of in the next couple days.~
i think dataset_snapshot
makes more sense than website_archive
. we have a particular definition for "dataset" so this terminology is aligned.
Actually, I take back what I said. I should be able to match the scraper based on the url it provides to the archival script in the schema.
Okay this works:
def create_metadata(url):
if not os.path.exists("schema.json"):
print("Please create a schema.json file in the same directory as this script and fill it out with the schema generator.")
return
else:
with open("schema.json", "r+", encoding="utf-8") as schema_out:
data = json.load(schema_out)
agency_data = data["data"]
# Find dictionary in "data" list that matches the URL
dataset_id = next((item for item in agency_data if item["url"] == url))
dataset_id = dataset_id["dataset_id"]
as far as terminology, that's fine—archives is sensible. I will need to do some work to change "Extractions" to "Archive" everywhere, which I can take care of in the next couple days.
Extractions would be the raw data, not the archive of the source of the dataset
i think
dataset_snapshot
makes more sense thanwebsite_archive
. we have a particular definition for "dataset" so this terminology is aligned. 👍
latest branch has the code for this
@CaptainStabs oh, I see. Yes, I agree, archive
represents what we create with internet archive, and extractions
are the raw data generated by the scrapers.
@CaptainStabs Now that I have seen the branch, I think dataset_archive
makes the most contextual sense.
This should ping the internet archive with a request to archive the site at the time the scraper should run.