memorialize program principles

Goals:

Build and maintain the most complete, accurate inventory of federal, public websites.
Scan those websites daily in order to generate a variety of useful data for known stakeholders.

Principles:

We work in the open.
We only consume information that is available to anyone over the public internet.
The program's products are machine-readable data files that are made publicly available, either as static flat files or curable APIs.
We do not make presentation layers.
We only design and build scans in response to user needs articulated by known stakeholders.
All scans run against the complete Federal Website Index.
If a scan is no longer needed or used by a known stakeholder, we deprecate it.
We follow the stakeholder experience (link).
We prioritize reliability and accuracy of scans that we have launched.
Our focus is on current data. Though scan data is snapshotted to an archive repo once a month, our system is ruthlessly focused on best providing current data and not on being a warehouse for historical data.

Model: We take public datasets, use an open-sourced method to assemble and process them, and then produce the resulting Federal Website Index as a hosted flat file. Anyone can download and interact with that file at a consistent fixed location.

We then tell the Site Scanning engine to ingest that public index file once a day and use it for a list of target URLs. Each target URL is then loaded and scanned, and the resulting data is put into a database. The database is queryable via an API, and every week publishes a snapshot of all the data as a bulk, downloadable flat file.

The methodologies of the scans are public so that anyone can see how the data for any given website was derived. We work to continually iterate and improve the index and the scan methodologies, while ensuring the reliability of the daily scans

GSA / site-scanning

memorialize program principles #995