aih / bills

A processor for bills in Go
MIT License
0 stars 0 forks source link

:toc: auto

Process Bills with Golang

This repository (bills) defines a module (github.com/aih/bills) with a number of packages defined in the /cmd directory.

To build and create commands in the bills/cmd directory, run make build (to apply the Makefile).

To test, run make test (the tests are not complete).

To run both, simply make.

The packages in the cmd directory, which build to cmd/bin are:

badgerkv:: a test for storing data in the badger database. (TODO: convert this instead to a test for the badgerkv package.) billmeta:: command-line tool to create bill metadata and store it to a file. Command-line options include -p to specify a parent path for the bills to process, or -billNumber to process a specific bill. The metadata is created by makeBillsMeta and enriched by finding bills that have the same titles and main titles. To run a sample and store results in testMeta.json, run cmd/bin/billmeta -parentPath ./samples -billMetaPath ./samples/test/results/testMeta.json committees:: command-line tool to download committees.yaml to tmp/committees.yaml comparematrix:: a command-line tool to which takes a list of bills as input and outputs a matrix of bill similarity (including the category of similarity) esquery:: find the similar bills for each section of bills. It depends on having an Elasticsearch index of bills, divided into sections. The esquery command can be run on a sample of bills, or all bills. Bills are not yet processed concurrently, but the architecture (processing one bill at a time, by bill number) is designed to allow this. jsonpgx:: a command-line tool to work with posgtresql legislators:: a command-line tool to download legislators.yaml to tmp/legislators.yaml unitedstates:: a stub (not currently working) that will download and process bill data and metadata

Note: Some of these commands process many files in parallel. In order to prevent problems on systems that limit open files (e.g. Ubuntu), we've added a max open files parameter (see, e.g. billmeta). In addition, to prevent crashes due to system memory limitations, on the production server, I increased file swap size to 4Gb (see https://askubuntu.com/a/1075516/686037).

Bill Processing Pipeline

Bills are downloaded into the file structure defined in https://github.com/unitedstates/congress. Additional JSON metadata files are created and stored in the filepath for each bill, e.g. [path]/congress/data/117/bills/hr/hr1500. The files are stored separately so that each process can be run independently and concurrently.

  1. Download documents to congress directory (using unitedstates repository at https://github.com/unitedstates/congress) TODO: develop a Go alternative for downloads.
  2. Process bill metadata (using billmeta) and store in the path for each bill, . There is also an option to store all metadata in a file [path]/congress/billMetaGo.json and in Golang key/value stores. This processing also creates a key/value store for titles and for main titles. These are stored in files (titleNoYearIndexGo.json and mainTitleNoYearIndexGo.json). TODO: add an option to save these indexes to a database
  3. Index bill xml to Elasticsearch. Currently, this is done in Python in https://github.com/aih/BillMap. The processing there is relatively fast (< 10 minutes to index all bills), and processing performance may be limited by calls to Elasticsearch, so a Go alternative may not result in much performance boost. Note that the billtoxml.go file contains utilities to parse XML and select sections.
  4. For each bill, find similar bills by section using the esquery command. The list of similar sections for each bill is stored in the filename defined as esSimilarityFileName = "esSimilarity.json" in cmd/esquery/main.go. The bills that are similar to the latest version of a given bill are collected in another file, defined as esSimilarBillsDictFileName = "esSimilarBillsDict.json".
  5. For the most similar bills, calculate similarity scores and assign categories (e.g. identical, nearly identical, includes, includedby). A map of bill:categories is stored (also as part of esquery) in a file defined by esSimilarCategoryFileName = "esSimilarCategory.json"