ccao-data / ptaxsim

R package for estimating Cook County property tax bills
https://ccao-data.github.io/ptaxsim/
GNU Affero General Public License v3.0
11 stars 2 forks source link

Collect additional property tax related data using LLMs #8

Open dfsnow opened 1 year ago

dfsnow commented 1 year ago

Goal

Collect additional property tax related data from documents, using LLMs for parsing.

Overview

There is a significant amount of useful taxing district data currently locked in non-machine-readable formats, including: TIF ordinance, TIF redevelopment plans, municipal/district budgets, SSA info, etc. If this data can be extracted and parsed, it would be a huge boon to PTAXSIM and would likely be the first ever collection of such data.

The problem is that this data is messy. There is no standard format for something like TIF ordinance, so each document will have a completely different format and language, depending on the municipality. Further, nearly all data of this type comes as PDF scans of legislative text - usually without any OCR applied - spanning hundreds or thousands of pages. As such, parsing this data into useful SQL tables is a massive challenge.

Fortunately, new tech may be able to help with this task. Current LLMs have proven especially capable of extracting relevant information from a large document or corpus. We may be able to use such LLMs to convert PDF scans of taxing district data into useful SQL tables.

:warning: This is an experimental/moonshot task. We don't know for sure that LLMs will work here or that there's enough structured information to be useful. However, if it does work, then it will produce the first digitized collection of such data and a nice proof-of-concept that we can potentially use elsewhere in the office.

Getting Started

The first thing we need to do is take inventory, first of data, then of LLMs. I would make spreadsheets tracking each of the relevant datapoints.

Data

We need to take stock of what data actually exists that is:

I recommend we start with the following datasets:

TIF information
Taxing district budgets

LLMs

The landscape around LLMs is changing pretty much daily right now. For this project to work, we need to take a snapshot of existing LLMs and determine their capabilities/whether they fit our needs. You'll need to do some exploration in this space. We're specifically looking for LLMs that:

Tasks

Before proceeding to coding, the following tasks should be complete:

Outline

Once the above tasks are complete, it's time to get coding. Since this will likely be a lot of data in various states of processing, I recommend making a data flow diagram + using the specific inventory (from above) to help track things. The coding can be divided into two stages: processing and package updates.

Processing

Broadly, you'll need to come up with a data collection schema that divides things into raw, processed, and completed buckets. We can create a new S3 bucket/dir you can use to store each stage. This will be the stage actually using LLMs. We can scope it out further as we get closer to this stage.

Package updates

Once parsing is complete, the collected data needs to be added to the actual PTAXSIM database. This will be much simpler than the processing stage:

Additional Requirements

mbjackson-capp commented 1 year ago

I've attached a sheet of on some applications that ingest documents or files and use LLM prompting to extract information from them. Document context extractor LLM options.xlsx

The options that we want will follow a template similar to the following:

As anticipated, this space is very new, and dozens of applications have been released within the past few months; many of those are difficult to vet for quality. Most of them have differential pricing that increases with the number of documents uploaded and/or queries per month.

I recommend we look first at Quivr and/or Steamship, which both have transparent codebases, thorough documentation, and a publicly identified person whom we can contact to discuss the software.

If available options aren't a good fit for the scale of data we plan to use, we may be able to code our own solution using a similar template to these projects.

mbjackson-capp commented 1 year ago

As discussed previously, there is at least one LLM that has been trained specifically on legal text, called LEGAL-BERT. See Chalkidis et al. (2020), "LEGAL-BERT: The Muppets straight out of Law School", https://arxiv.org/abs/2010.02559.

The model is available on HuggingFace, and may be possible to "swap in" to a preexisting LLM-based application to compare performance against a more generalist model: https://huggingface.co/nlpaueb

Two more research papers on the tailoring of these models for legal text are:

dfsnow commented 1 year ago

@mbjackson-capp Great work! I agree that Quivr and Steamship are probably the right place to start, with LEGAL-BERT and roll-your-own as fallbacks.

I think next steps are getting a handle on what data exists county-wide, figuring out which data elements they share/we can extract, and starting to build a small corpus of documents.

mbjackson-capp commented 1 year ago

As we found this week, the technology in this field is very nascent. Before advancing, it may make sense to wait a few weeks/months until a dominant player has emerged with an inexpensive, user-friendly application (and/or until more people have gained experience coding these systems, who could advise us or work with us to meet our needs).

Efforts to systematically gather PDFs in an automated manner were largely unsuccessful. With a mixture of basic web scraping and more manual downloading, I did assemble the following:

These files amount to 2.11 GB, and are stored in a location I have messaged you separately. The vast majority of them still need OCR, though mass-OCR is feasible with Acrobat. Filenames are largely, but not fully, standardized (I changed some filenames to make them more standard).

We should continue to think about scoping which fields to extract from the TIF data, and about which forms are most likely to track that information over time.

More notes about data retrieval

It seems like there's no way to scrape all the redevelopment plans and ordinances from the map page directly, but they seem to exist in a single directory which is Forbidden (https://www.chicago.gov/content/dam/city/depts/dcd/tif/plans/).

More annual reports for TIFs across the County could be scraped from the Treasurer's office with a systematic approach to searching municipalities' names and/or iterating through the numbered code that the Treasurer uses for those municipalities: https://illinoiscomptroller.gov/constituent-services/local-government/local-government-warehouse/searchform/?SearchType=TIFSearch There are also annual reports going from 1997 to 2021 for the City, which could be scraped automatically with a tuned web scraper script: https://www.chicago.gov/city/en/depts/dcd/supp_info/tif-district-annual-reports-2004-present.html (each link lead to a big raft of PDFs: https://www.chicago.gov/city/en/depts/dcd/supp_info/district-annual-reports--2021-.html)

dfsnow commented 1 year ago

Excellent, thanks for the update @mbjackson-capp. We'll restart this issue once the technology settles down a bit. In the meantime, I'll continue to put out feelers for more TIF and budget data and will manually add to what you've already collected.