For the latest paper please see here.
For the latest data please see here.
Please note that the code in this repository reflects an earlier version of the project. An update with more recent code will arrive soon.
This is the Github repository for a project using Large Language Models (LLMs) to parse zoning documents. We introduce a new approach to decode and interpret statutes and administrative documents employing LLMs for data collection and analysis that we call generative regulatory measurement. We use this tool to construct a detailed assessment of U.S. zoning regulations. We estimate the correlation of these housing regulations with housing costs and construction. Our work highlights the efficacy and reliability of LLMs in measuring and interpreting complex regulatory datasets.
Please note that all LLMs used in this project have a degree of randomness and thus cannot be exactly replicated. We limit the degree of randomness by, for example, setting the temperature of each model to very low levels. Replication results should still be very similar.
Main dependencies: Please see requirements.txt
for a full list of the Python packages that need to be installed to replicate the project. Run the below command in the terminal to install dependencies.
pip install -r requirements.txt
1) You can clone the repository using:
git clone https://github.com/dmilo75/ai-zoning.git
3) To create embeddings, we have to configure config.yaml for
raw_data: path where the questions are present 4) After setting the paths we can run embedding.py
python3 "path-to-repository/code/Main Model Code/embeddings.py"
This process creates embeddings from raw text and stores them in the embeddings path provided in config.yaml file.
Steps 1 and 2 should be done if planning to use the llama2 model. Else, it is not required.
1) For downloading llama2 13B/70B GPTQ, you can run the Python script (ai-zoning/code/download.py) that can download the model from huggingface. Make sure to change the download directory while using the code. 2) Clone the exllama repository in a separate directory using:
git clone https://github.com/turboderp/exllama.git
.
4) To perform inference, we have to configure config.yaml for
num_neighbors: optional. Determines the number of chunks of text to include in the context. 5) After setting the paths we can run QA_Code_V3.py
python3 "path-to-repository/code/Main Model Code/QA_Code_V3.py"
This process gathers inference results on the LLM and stores them inside the processed_data path provided in config.yaml file.
More detailed information has been provided further.
To run the code on the Llama2 13B model, it takes 27GB of GPU memory per node.
For running the Llama2 13B model on the whole of the national sample, after data parallelization, it took 7 hours for each node with a total of 50 NVIDIA Quadro RTX 8000 nodes.
The 'Sample_Data' Excel in the raw data
contains the list of all municipalities in our sample along with their source and unique identifiers. The file was created by merging the 2022 Census of Governments dataset with the metadata (name, state, zip code, website, address, etc.) for all scraped municipalities. As a result, please reference the Census of Governments data dictionary here for variable definitions. We will reference this file to scrape text and format the filename of ordinance text files. All scraped text is stored in a directory defined by the user as 'muni_text' in the config.yaml
file.
Scraping Ordinance Text
The text for each ordinance must be scraped from one of the three following sources:
The 'Source' column in the Sample_Data
Excel file indicates the source used for each municipality. Use this column to identify where to scrape the text for each ordinance file.
Take caution when scraping tables. While American Legal Publishing and Municode store tables in html format (which can easily be scraped), Ordinance.com stores tables in image format. Since LLMs available at the time of the study only handle text, we use Amazon Textract to extract text from images of tables from Ordinance.com.
Naming Convention and Source Reference
Use the following convention for naming text files. Note that the variable names like 'UNIT_NAME' correspond to columns in the 'Sample Data' excel.
The naming format for the text files should be: UNIT_NAME#FIPS_PLACE#State.txt
.
For example, see the below:
/muni_text
/al
alabaster#820#al.txt
albertville#988#al.txt
/ak
anchorage#3000#ak.txt
angoon#3440#ak.txt
Llama-2 Based Models
To use llama-2 based models you must first register with Meta here. Once, registered please download the specific models used in the analysis from Hugging Face which houses quantized models compatible with exllama. We use the following models: 1. Llama-2-13B-GPTQ under the 'main' branch.
We also use Exllama to speed up processing time. Please download this as well and provide the appropriate path.
Chat GPT Based Models
For Chat GPT based models please register for an account with OpenAI here. Please place your API key in config.yaml
. We use the model 'gpt-4' for Chat GPT 4 and 'gpt-3.5-turbo' for Chat GPT 3.5; see here for a list of all available Chat GPT based models.
The code is split into three parts/folders: Pre Processing Code
, Main Model Code
, and Table and Figures Code
.
The user must provide a set of paths in config.yaml
.
muni_text
: This is the path of the directory that contains all municipal text files.embeddings
: This is the path of the directory with all embedding files. embedding.py
produces the files for this directory.exllama_path
: Path where exllama is storedllama13b_path
: Path where the model for 11ama-2 13b is stored.processed_data
: Path to processed data
directoryraw_data
: Path to raw data
directoryshape_files
: Path to shape files directoryfigures_path
: Path to figures folder in results
tables_path
: Path to tables folder in results
The user also can input their OpenAI (Chat GPT) API Key and the number of neighbors used to construct context. Be careful when adjusting the number of neighbors used to construct context. Most models used in our study (all except for Chat GPT 4) use 4k tokens of context. If you increase the number of neighbors (or number of relevant chunks of text) then you must decrease the size of each text chunk in embedding.py
.
Pre-processing consists of three scripts
Download Shape Files.py
: Given a directory to store shape files under shape_files
in config.yaml
, this script will download all relevant US Census shape files for counties, county subdivisions, places, urban areas, and water areas. We use the 2022 TIGER/Line shape files downloaded from here, but you may find the web interface more helpful here if manually downloading shape files.Make % Urban.py
: This script calculates the percent overlap between the 2022 shape files for municipalities and the 2020 shape file for urban areas. It produces the Excel file urban_raw.xlsx
in the raw data
directory.Merge Raw Data.py
: This script merges all raw housing/demographic data (American Community Survey data, Building Permits data, Urban Area, and MSA classifications) with our municipality identifier dataset Sample Data.xlsx
. The resultant excel file is Sample_Enriched.xlsx
located in processed data
.The embeddings and Q&A Code (which runs Llama-2) require powerful graphics cards (we use the RTX8000). Each of these codes requires SLURM job arrays, which we use to parallelize the code across several graphics cards/processors. Note that these scripts can be adapted to use other forms of parallelization or to run in serial.
embeddings.py
)Below is an example SLURM job array request for the embeddings file which parallelizes the code across two nodes.
sbatch --array=0-1 embed.sbatch
Your corresponding embed.sbatch
batch file should be similar to the following, but please consult the staff at your High Performing Computer center for more advice:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=3:00:00
#SBATCH --mem=4GB
#SBATCH --gres=gpu:rtx8000:1
#SBATCH --job-name=embed
module purge
singularity exec --nv \
--bind /path/to/cache:$HOME/.cache \
--overlay /path/to/pytorch-ext:/path/to/pytorch-ext:ro \
/path/to/singularity-image.sif\
/bin/bash -c "source /path/to/env.sh; python3 embedding.py"
Note: The above uses Python with singularity.
QA_Code_V3.py
)The QA Code is the question-and-answer code or the main hub for running the model. For the QA_Code_V3.py
script, you must specify 1-2 system arguments:
The first argument is for choosing the model. Valid options are:
The second argument is only relevant when running the code on one node (i.e. not using a SLURM job array). This is an integer representing the sample size (number of munis) to use from the training sample (107 randomly selected munis from the Pioneer Institute study). Use this parameter for either testing the code (by running on small samples) or for comparing model performance to the Pioneer Institute.
Example of a job array parallelized across 50 nodes, run with a single argument '13b' for Llama-2 13B::
sbatch --array=0-49 qacode.sbatch 13b
Example job run with one argument '13b' run on one node (thus only run on the training sample):
sbatch qacode.sbatch 13b
Example job run with two arguments '13b' and '5' (thus run on a random sample of 5 municipalities from the training sample):
sbatch qacode.sbatch 13b 5
Your corresponding qacode.sbatch
batch file should be similar to the following, but please consult the staff at your High Performing Computer center for more advice:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=3:00:00
#SBATCH --mem=10GB
#SBATCH --gres=gpu:rtx8000:1
#SBATCH --job-name=qacode
module purge
singularity exec --nv \
--bind /path/to/cache:$HOME/.cache \
--overlay /path/to/pytorch-ext:/path/to/pytorch-ext:ro \
/path/to/singularity-image.sif\
/bin/bash -c "source /path/to/env.sh; python3 QA_Code_V3.py $1 $2"
Note: The above uses Python with singularity.
The QA Code also calls on three other scripts:
helper_functionsV2.py
): This script contains various functions used by the QA code like finding relevant text and querying the LLM. gpt_functions.py
): To elicit structured responses (for example, binary responses) from Chat GPT we use function calling. Please see here for further details.korhelper.py
): Kor allows us to map open-ended unstructured Llama-2 responses to parseable answers like "Yes", "No", or "I don't know". We provide training data for Kor in the raw data
directory under the Excel file Kor Training.xlsx
. clean_llama_output.py
)This script finishes any remaining answer parsing from Kor and calculates the minimum minimum lot size and the mean minimum lot size for each municipality.
There are three scripts to make tables and figures
map_maker.py
: This script loops over a few select cities and questions creating maps of the regulatory environment in that area. Please ensure that you have populated a shape file folder and provided its path in config.yaml
before running this script.Table 1.py
: This script creates tables 1 from the paper about our dataset coverage.Tables 2 and A2 and Figures Histogram and Cleveland Dot.py
: As the title suggests, this script creates charts for tables 2, A2, and for the histogram and Cleveland dot plot figures. Processed data consists of model output and a merged dataset with municipality characteristics. A folder for each model (llama-13b, llama-70b, gpt-35-turbo, and gpt-4) will contain the model output in pickle format. The model output for llama-2 based models is also available in Excel format as the output from clean_llama_output.py
. Finally, various municipality characteristics used to make figures and tables are stored in Sample_Enriched.xlsx
.
Information on Sample Data.xlsx
is explained previously.
Kor Training.xlsx
is a collection of manually parsed raw llama-2 output that serves as training data for Kor to parse further responses. To edit/add more training data simply take open-ended responses from llama-2 and manually parse the answer. msa_crosswalk_mar_2020.xlsx
maps counties to their respective MSA. We use this to understand our MSA population coverage. Questions.xlsx
is the list of questions used for the analysis along with an ID for each question and a categorization of the type of question (i.e. 'Binary' or 'Numerical'). You may add additional questions here if needed. bps_raw.xlsx
refers to the building permits survey data. We use the 2022 Building Permits Survey data. For a data dictionary see [here](https://www2.census.gov/econ/bps/Documentation/placeasc.pdftraining.pickle
is a list of the random sample of 107 of the municipalities used in the Pioneer Institute study. We use this cut of the dataset to check model performance as we iterate. We do not test performance on the remaining municipalities to have data that we can fine-tune on and test after improvements have been made.ACS Data
is a folder with all American Community Survey variables relevant to the analysis.2022 Population Data
contains data on MSA and state level population data in 2022. This directory holds the results in the form of Excel file tables in the tables
folder and in the form of images in the figures
folder. All tables and figures in this directory are generated from code found in code
->Table and Figures Code
Contact Info Please contact dm4766@stern.nyu.edu with any inquiries.
License
This project is licensed under the MIT License - see the LICENSE file for details.