Properties of Data Sources to identify

Police-Data-Accessibility-Project / data-source-identification

Scripts for labeling relevant URLs as Data Sources.

MIT License

5 stars 6 forks source link

Properties of Data Sources to identify #11

Open josh-chamberlain opened 1 year ago

josh-chamberlain commented 1 year ago

Context

We want to add metadata to URLs, filter for relevancy, and expand our database of valid data sources.

Flowchart

The overall plan for data source identification is now in the readme of this repo.

Properties

These are all explained in the data dictionary

S tier

[x] #9
[x] #12
[ ] #43

A tier

[ ] aggregation_type
[ ] access_type
[ ] record_download_option_provided
[ ] record_format
[ ] Is it agency_supplied and agency_originated? If not, who are the supplier and originator?
[ ] coverage_start
[ ] coverage_end
[ ] portal_type
[ ] scraper_url
[ ] readme_url

Still A tier, but rarely published:

[ ] retention_schedule
[ ] update_frequency
[ ] source_last_updated

B tier

[ ] size
[ ] update_method
[ ] sort_method
[ ] access_restrictions

nfmcclure commented 1 year ago

An issue with the common-crawl, is that it doesn't get all the PD org sites. The common crawl is very respectful of robots.txt and if a servers response time is lagging, it'll stop hitting that server all together. This can be seen because:

The common crawl export of .gov sites with 'pd' of 'police' in it has 9 separate sites from the host "https://police.birminghamal.gov". But if we look at the sitemap: https://police.birminghamal.gov/sitemap.xml, we can see much much more there.

I think the above is a separate issue because the solution requires scraping. I would suggest the solution here would be to find all the unique hosts and look at the (host-url/sitemap.xml) and extracting all the URLs listed there. (if sitemap exists).

I think the real issue is how to clean and categorize any URL. (then this solution can be applied to future sitemaps).

First and foremost, identifying relevant URLs in a large list is important. Here are some suggestions for next steps:

Can we find rules to remove bad URLs? E.g. 'blog', or 'store', or URLs with question marks in them, or non-secure http urls.
From the remaining, we would have to label URLs as relevant or not. Then we can use a very similar model(s) as the prior linked palewire/storysniffer. It looks like they started out labelling about 1-3 thousand URLs.

Second, I think from these relevant URLs, identifying features from them is important.

Agency Location. Identifying the department location (State police: CA, AZ, ..., metro police: NYC, LA, ..., county police: King County, ...)
- I think this should be attempted first in a dumb/simple way. Get a list of all metros, counties-states, and just look at which location string matches the host/domain url or URL + homepage HTML.
Record Type. This is hard. Probably something similar to the prior URL classification from storysniffer.
Coverage & Dates: Maybe something with the common crawl last access date? e.g. the most recent is 2022-40 (40th week of 2022 = Oct 3rd, 2022)? Also, note that the XML sitemaps tend to have a field "last modified", if the URL is in there. But again, looking at the XML sitemaps is more of a scraping task. Where-as the common crawl has already scraped any URL it contains.
Agency-supplied: Most of the URLs searched have the top-level-domains of gov or us. This usually means they are agency-supplied. We can search more domain names if we want.
Size. More thought needed here. Is the full HTML size an upper bound? Assuming data on URL.
others.... TBD.

josh-chamberlain commented 1 year ago

Thanks for thinking this through @nfmcclure ! I think gathering info from the URL will work in some cases—but I'm sure we'll miss a lot, and a lot of URLs are simply unhelpful. At some point we'll need to start looking at page headers and content to identify sources.

Agency location—since we have a homepage URL for most of the agencies in our database, we should be able to get a lot of URLs matched to agencies by simply comparing root domains. We could also probably use those root domains as a way to search for new URLs. I think the process would be something like:
- take an agency homepage URL
- use a combination of commoncrawl, probably internet archive, sitemap generator or find the sitemap to locate URLs on the domain, as well as other domains being used to display data (lots of data portals aren't on the government domain)
- run the URLs through a little toolkit of identification processes and scripts to get as many properties established as possible
Record type is definitely hard to get from the URL, though in some cases I'm sure it'd work. I think using what's on the page will be the best bet.
Good idea!
Agreed.
This depends—if there's a "download" button on the page, the size on disk of the file behind that button is the size we want. That said, "size" is a nice-to-know about a data source but not required. I'll update the hierarchy of these properties here in a minute.

Any strategy we develop will hit a point of diminishing returns where it's easier to just manually look at what's left, which is A-OK.

nfmcclure commented 1 year ago

I filtered about 1,000 unique host domains from states, counties, and cities.

examples:

https://troymi.gov/
https://trumanmn.us/
https://turlock.ca.us/

I'm guessing there's about 2,000 total unique host domains in the above CSV.

I built a scrapy Sitemap spider that looks through each robots.txt or sitemap.xml and finds all the routes on the server with the words "police" or "cop" in it. It stores the URL + last-modified date (if exists).

From 1000 host domains, it gets about 40k url paths.