:toc: auto
This repository (bills
) defines a module (github.com/aih/bills
) with a number of packages defined in the /cmd
directory.
To build and create commands in the bills/cmd
directory, run make build
(to apply the Makefile
).
To test, run make test
(the tests are not complete).
To run both, simply make
.
The packages in the cmd
directory, which build to cmd/bin
are:
badgerkv:: a test for storing data in the badger
database. (TODO: convert this instead to a test for the badgerkv
package.)
billmeta:: command-line tool to create bill metadata and store it to a file. Command-line options include -p
to specify a parent path for the bills to process, or -billNumber
to process a specific bill. The metadata is created by makeBillsMeta and enriched by finding bills that have the same titles and main titles.
To run a sample and store results in testMeta.json
, run cmd/bin/billmeta -parentPath ./samples -billMetaPath ./samples/test/results/testMeta.json
committees:: command-line tool to download committees.yaml to tmp/committees.yaml
comparematrix:: a command-line tool to which takes a list of bills as input and outputs a matrix of bill similarity (including the category of similarity)
esquery:: find the similar bills for each section of bills. It depends on having an Elasticsearch index of bills, divided into sections. The esquery command can be run on a sample of bills, or all bills. Bills are not yet processed concurrently, but the architecture (processing one bill at a time, by bill number) is designed to allow this.
jsonpgx:: a command-line tool to work with posgtresql
legislators:: a command-line tool to download legislators.yaml to tmp/legislators.yaml
unitedstates:: a stub (not currently working) that will download and process bill data and metadata
Note: Some of these commands process many files in parallel. In order to prevent problems on systems that limit open files (e.g. Ubuntu), we've added a max open files parameter (see, e.g. billmeta
). In addition, to prevent crashes due to system memory limitations, on the production server, I increased file swap size to 4Gb (see https://askubuntu.com/a/1075516/686037).
Bills are downloaded into the file structure defined in https://github.com/unitedstates/congress
. Additional JSON metadata files are created and stored in the filepath for each bill, e.g. [path]/congress/data/117/bills/hr/hr1500
. The files are stored separately so that each process can be run independently and concurrently.
congress
directory (using unitedstates
repository at https://github.com/unitedstates/congress)
TODO: develop a Go alternative for downloads.billmeta
) and store in the path for each bill, . There is also an option to store all metadata in a file [path]/congress/billMetaGo.json
and in Golang key/value stores. This processing also creates a key/value store for titles and for main titles. These are stored in files (titleNoYearIndexGo.json and mainTitleNoYearIndexGo.json).
TODO: add an option to save these indexes to a databasebilltoxml.go
file contains utilities to parse XML and select sections. esquery
command. The list of similar sections for each bill is stored in the filename defined as esSimilarityFileName = "esSimilarity.json"
in cmd/esquery/main.go
. The bills that are similar to the latest version of a given bill are collected in another file, defined as esSimilarBillsDictFileName = "esSimilarBillsDict.json"
.identical
, nearly identical
, includes
, includedby
). A map of bill:categories is stored (also as part of esquery
) in a file defined by esSimilarCategoryFileName = "esSimilarCategory.json"