Collect additional property tax related data using LLMs

Goal

Collect additional property tax related data from documents, using LLMs for parsing.

Overview

There is a significant amount of useful taxing district data currently locked in non-machine-readable formats, including: TIF ordinance, TIF redevelopment plans, municipal/district budgets, SSA info, etc. If this data can be extracted and parsed, it would be a huge boon to PTAXSIM and would likely be the first ever collection of such data.

The problem is that this data is messy. There is no standard format for something like TIF ordinance, so each document will have a completely different format and language, depending on the municipality. Further, nearly all data of this type comes as PDF scans of legislative text - usually without any OCR applied - spanning hundreds or thousands of pages. As such, parsing this data into useful SQL tables is a massive challenge.

Fortunately, new tech may be able to help with this task. Current LLMs have proven especially capable of extracting relevant information from a large document or corpus. We may be able to use such LLMs to convert PDF scans of taxing district data into useful SQL tables.

:warning: This is an experimental/moonshot task. We don't know for sure that LLMs will work here or that there's enough structured information to be useful. However, if it does work, then it will produce the first digitized collection of such data and a nice proof-of-concept that we can potentially use elsewhere in the office.

Getting Started

The first thing we need to do is take inventory, first of data, then of LLMs. I would make spreadsheets tracking each of the relevant datapoints.

Data

We need to take stock of what data actually exists that is:

Available - Easy(ish) to collect
Valuable - Actually needed inside PTAXSIM and useful for analysis
Parsable - Possible to be read and contextualized by an LLM

I recommend we start with the following datasets:

TIF information

Includes ordinance, redevelopment plans, and annual reports
- See examples for the City of Chicago
- Example ordinance
- Redev. plan
This data might be a good place to start because it's limited in scope. There are a finite number of TIFs in Cook County, and most of their text should be available online
Possible datapoints to collect include:
- Establishment information (who proposed, what criteria were used, what was the projected revenue, initial PINs included, what projects were originally planned)
- Expenditures and porting information (where is money going, to what other TIFs)

Taxing district budgets

Includes topline expenditures by taxing agency + contextual notes on that agency's main functions
This might be easier to collect for the county since most bodies publish a public budget, but condensing it down into a single SQL table will be challenging
This will be harder to collect in the long run since there are far more taxing agencies than TIFs, and budgets would need to be collected for each fiscal year + different agencies my have different fiscal years

LLMs

The landscape around LLMs is changing pretty much daily right now. For this project to work, we need to take a snapshot of existing LLMs and determine their capabilities/whether they fit our needs. You'll need to do some exploration in this space. We're specifically looking for LLMs that:

Can ingest a large document and provide a summary or key bits of information, ideally in a machine-readable output format
Can ideally ingest a raw image PDF, rather than one that's OCR'd. If necessary, we could setup a separate OCR pipeline
Are specifically trained on legal or government documents
Are low-cost or free. We can run locally or on EC2 if needed

Tasks

Before proceeding to coding, the following tasks should be complete:

[ ] Topline inventory of data to be collected. Can be a markdown list or Excel sheet attached to this issue. Should include:
- Which documents need to be collected (TIF ordinance, muni budgets, etc.)
- Where those documents will be collected from
- What (estimated) percentage of those documents are immediately available i.e. without a records request
- What datapoints can be collected from those documents i.e. establishment criteria, projected revenue, etc.
[x] Inventory of LLMs and their capabilities. I would create a table/matrix with each LLM as a column and each capability or attribute as a row. Attach to this issue
[ ] Specific inventory of data to be collected. Once the topline inventory is done, make a list of all the documents we need to collect, their source, whether they've been fetched, whether they've been OCR'd, whether they've been LLM-parsed, etc.

Outline

Once the above tasks are complete, it's time to get coding. Since this will likely be a lot of data in various states of processing, I recommend making a data flow diagram + using the specific inventory (from above) to help track things. The coding can be divided into two stages: processing and package updates.

Processing

Broadly, you'll need to come up with a data collection schema that divides things into raw, processed, and completed buckets. We can create a new S3 bucket/dir you can use to store each stage. This will be the stage actually using LLMs. We can scope it out further as we get closer to this stage.

[ ] Share a broad overview of data processing architecture with @dfsnow
[ ] Any scripts used for processing must live in data-raw/, though we may not want the raw data itself there

Package updates

Once parsing is complete, the collected data needs to be added to the actual PTAXSIM database. This will be much simpler than the processing stage:

[ ] Update data-raw/create_db.sql to add new table definitions for your finished data
[ ] Add a new script (or scripts) to data-raw/ that pulls the processed data from S3 and loads it into the SQLite DB (via data-raw/create_db.R)
[ ] Update the database diagrams in the README to include your new tables
[ ] (Reach goal) Add a new R function (or arguments to an existing function) to return your data
[ ] (Reach goal) Add a short document to vignettes/ describing what your data is and how to use it

Additional Requirements

Any updates to the package must come via a pull request. You should work on a separate branch and notify @dfsnow when ready to merge
Don't commit large data objects to the repository, particularly large Git LFS objects
This data must be accurate. At some point down the line we will need to discuss a review process for this data

I've attached a sheet of on some applications that ingest documents or files and use LLM prompting to extract information from them. Document context extractor LLM options.xlsx

The options that we want will follow a template similar to the following:

Use or access one or more LLMs proper (GPT-x, etc.), either locally or via API
Allow users to upload PDF files (or sometimes other file types) into a stored database. These uploads do not query the model or use tokens, and ideally persist over time.
Embed the user's prompt within an engineered prompt that binds the model to give answers only from within the text in the stored database, and say something like "I don't know" if it cannot obtain an answer in said text

As anticipated, this space is very new, and dozens of applications have been released within the past few months; many of those are difficult to vet for quality. Most of them have differential pricing that increases with the number of documents uploaded and/or queries per month.

I recommend we look first at Quivr and/or Steamship, which both have transparent codebases, thorough documentation, and a publicly identified person whom we can contact to discuss the software.

If available options aren't a good fit for the scale of data we plan to use, we may be able to code our own solution using a similar template to these projects.

As discussed previously, there is at least one LLM that has been trained specifically on legal text, called LEGAL-BERT. See Chalkidis et al. (2020), "LEGAL-BERT: The Muppets straight out of Law School", https://arxiv.org/abs/2010.02559.

The model is available on HuggingFace, and may be possible to "swap in" to a preexisting LLM-based application to compare performance against a more generalist model: https://huggingface.co/nlpaueb

Two more research papers on the tailoring of these models for legal text are:

Mamakas et al (2022), "Processing Long Legal Documents with Pre-trained Transformers: Modding LegalBERT and Longformer", https://aclanthology.org/2022.nllp-1.11.pdf
Kim et al (2021)," "Learning Bill Similarity with Annotated and Augmented Corpora of Bill", https://aclanthology.org/2021.emnlp-main.787.pdf

@mbjackson-capp Great work! I agree that Quivr and Steamship are probably the right place to start, with LEGAL-BERT and roll-your-own as fallbacks.

I think next steps are getting a handle on what data exists county-wide, figuring out which data elements they share/we can extract, and starting to build a small corpus of documents.

As we found this week, the technology in this field is very nascent. Before advancing, it may make sense to wait a few weeks/months until a dominant player has emerged with an inexpensive, user-friendly application (and/or until more people have gained experience coding these systems, who could advise us or work with us to meet our needs).

Efforts to systematically gather PDFs in an automated manner were largely unsuccessful. With a mixture of basic web scraping and more manual downloading, I did assemble the following:

Evanston, Schaumburg, and Wheeling: TIF maps, ordinances, and redevelopment plans as available from those municipalities
Chicago: most annual reports 2020-2022 (available from Illinois State Treasurer site), most redevelopment plans and ordinances (available by click-through on the city's TIF map page)

These files amount to 2.11 GB, and are stored in a location I have messaged you separately. The vast majority of them still need OCR, though mass-OCR is feasible with Acrobat. Filenames are largely, but not fully, standardized (I changed some filenames to make them more standard).

We should continue to think about scoping which fields to extract from the TIF data, and about which forms are most likely to track that information over time.

More notes about data retrieval

It seems like there's no way to scrape all the redevelopment plans and ordinances from the map page directly, but they seem to exist in a single directory which is Forbidden (https://www.chicago.gov/content/dam/city/depts/dcd/tif/plans/).

More annual reports for TIFs across the County could be scraped from the Treasurer's office with a systematic approach to searching municipalities' names and/or iterating through the numbered code that the Treasurer uses for those municipalities: https://illinoiscomptroller.gov/constituent-services/local-government/local-government-warehouse/searchform/?SearchType=TIFSearch There are also annual reports going from 1997 to 2021 for the City, which could be scraped automatically with a tuned web scraper script: https://www.chicago.gov/city/en/depts/dcd/supp_info/tif-district-annual-reports-2004-present.html (each link lead to a big raft of PDFs: https://www.chicago.gov/city/en/depts/dcd/supp_info/district-annual-reports--2021-.html)

Excellent, thanks for the update @mbjackson-capp. We'll restart this issue once the technology settles down a bit. In the meantime, I'll continue to put out feelers for more TIF and budget data and will manually add to what you've already collected.

ccao-data / ptaxsim