Radiance-Technologies / prism

Utilities for creating and curating a dataset for automated proof repair in the Coq Proof Assistant.
GNU Lesser General Public License v3.0
5 stars 2 forks source link

Distribution Statement A (Approved for Public Release, Distribution Unlimited).

PRISM

Code for the paper Proof Repair Infrastructure for Supervised Models: Building a Large Proof Repair Dataset.

Please cite as

@inproceedings{reichel_proof_2023,
    address = {Dagstuhl, Germany},
    series = {Leibniz {International} {Proceedings} in {Informatics} ({LIPIcs})},
    title = {Proof {Repair} {Infrastructure} for {Supervised} {Models}: {Building} a {Large} {Proof} {Repair} {Dataset}},
    volume = {268},
    copyright = {All rights reserved},
    isbn = {978-3-95977-284-6},
    shorttitle = {Proof {Repair} {Infrastructure} for {Supervised} {Models}},
    url = {https://drops.dagstuhl.de/opus/volltexte/2023/18401},
    doi = {10.4230/LIPIcs.ITP.2023.26},
    urldate = {2023-07-27},
    booktitle = {14th {International} {Conference} on {Interactive} {Theorem} {Proving} ({ITP} 2023)},
    publisher = {Schloss Dagstuhl – Leibniz-Zentrum für Informatik},
    author = {Reichel, Tom and Henderson, R. Wesley and Touchet, Andrew and Gardner, Andrew and Ringer, Talia},
    editor = {Naumowicz, Adam and Thiemann, René},
    year = {2023},
    note = {ISSN: 1868-8969},
    keywords = {machine learning, proof repair, benchmarks, datasets, formal proof},
    pages = {26:1--26:20},
}

Description

PRISM is a project management and data collection framework for the Coq Proof Assistant geared towards enabling creation and curation of AI/ML datasets for proof engineering tasks including proof repair. Major subcomponents include virtual environment management through opam switches, proof extraction utilities, and repair mining functions.

Getting Started

Prerequisites

PRISM only supports Unix hosts and has only been tested in Ubuntu 20.04 and Ubuntu 22.04, so this guide assumes access to a Linux terminal. In addition, the following system packages or libraries must be installed:

Workspace/Remote Environment Setup

Integrated Development Environment Setup

Visual Studio Code with the following minimal extensions installed and enabled is the recommended IDE:

In addition, one is recommended to set the following variables in their settings JSON: "editor.rulers": [72, 80, 88] and "python.defaultInterpreterPath": "path/to/prism/venv-X.Y.Z/bin/pythonX.Y" with the latter appropriately modified to match the location of the preferred Python interpreter.

A Brief Guide to the Source Code

Docstrings are generally expected to give a sufficient explanation of contents and functionality within PRISM source code. Users are encouraged to explore on their own. Some highlights are listed below:

Scripts

Standalone scripts that provide both core and superficial utilities are housed in the scripts directory. Note that these scripts lie outside CI and are not necessarily maintained.

Some important scripts are highlighted here.

Performing repair mining

The primary purpose of PRISM is to enable mining of proof repairs from Git repositories. The following steps are recommended to perform repair mining (after following the directions in Getting Started). Repair mining is divided into two phases: cache extraction and repair mining.

Cache Extraction

In the cache extraction phase, we iterate over the commits of one or more projects and independently attempt to build them before extracting line-by-line context. The ultimate output for a given commit is a ProjectCommitData. The cache files are stored in a directory structure defined by the CoqProjectBuildCacheProtocol. The extract_cache.py script is the main entrypoint for cache extraction and will handle creation of a base pool of switches for a SwiM to use. Arguments to the script allow one to limit extraction to a subset of projects, Coq versions, commits, or files. See the script's help message for more details.

Builds are attempted for supported Coq versions in reverse order. Extraction of a large number of projects or commits can take a very long time, so it is recommended that one uses a terminal multiplexer such as tmux. Progress can be monitored in a separate shell with the monitor-caching.py, which will show the number of successful and failed extractions. Do not be alarmed by a large number of failures (shown as errors), as some commits simply cannot be extracted.

Failures are grouped into three categories: build errors, cache errors, and miscellaneous errors. These errors may be treated as symptoms; determination of the underlying cause of failure is a manual procedure. Build errors indicate that something went wrong during a build process and may indicate either a buggy project commit or incorrect metadata that could not be repaired by builtin inference mechanisms. Cache errors imply a bug in the extraction code itself, either from a flawed assertion/assumption or some other logical error. Cache errors are the only category indicative of something unambiguously wrong, but each such error generally impacts only a subset of commits. Miscellaneous errors are captured exceptions that do not fall within the previous two categories. Failure to obtain a switch from the SwiM due to unsatisfiable dependencies is a common miscellaneous error, which may or may not be indicative of flawed metadata.

If extraction of a project commit encountered an error, one can open the error log in the cache to obtain a stack trace, which may also identify the Coq file and line at which the error occurred. Running extraction limited to just the project, commit, Coq version, and file enables one to use a debugger to more precisely identify the cause and possible fixes.

Extraction contains multiprocessing components across projects and files, so be sure that the number of workers chosen is appropriate for your available resources. In addition, be prepared for a significant consumption of disk space from cloned switches created by the SwiM. Finally, note that even though --max-proj-build-runtime X will abort an extraction if the project takes more than X seconds to build, a timeout on the actual extraction of commands via line-by-line execution may last longer.

Repair Mining

In the repair mining phase, we consider pairs of extracted commits and mine repairs from their changes. The ultimate output is a collection of pairs of files, each storing a serialized GitRepairInstance and ProjectCommitDataRepairInstance, respectively. The mine_repair_instances.py script is the main entrypoint for repair mining and takes a configuration file as its only argument. A sample configuration is provided by mine_repair_instances. Note that one should adjust the cache_root, repair_instance_db_directory, and metadata_storage_file to match the working directory when the repair mining script is called. In order to prevent successive repair mining runs from clobbering one another, especially when there have been changes to prism, it is recommended that one first generate a configuration with produce_config.py, which will generate a base configuration file with the current commit as the name of the root directory. The configuration file simply stores a JSON dictionary containing keyword arguments to the repair_mining_loop, where one can find a description for each.

Repair mining contains multiprocessing components and loading large files into memory, so be sure that the number of workers chosen is appropriate for your available resources. Choosing too many workers may result in difficult-to-debug out-of-memory errors. Note that fast is enabled by default in generated configurations, which strips the loaded caches (and resulting ProjectCommitDataRepairInstances) of ASTs and other expensive to process/store data.

One may wish to perform repair mining on (disjoint) caches on one or more computers. Repair instance databases produced in this manner can be merged after the fact using the merge_repair_instance_dbs.py script, which will deduplicate any repairs in their intersection.

Some basic summary statistics can be collected using the filter_repair_examples.py. For example, one may count the total number of repair examples in a database with a given (set of) tag(s). Tags of particular importance include repair:proof and repair:specification. The former indicates a repair to the body of a proof, whereas the latter indicates a repair to a lemma/conjecture itself. Cataloguing repairs based on the type of Vernacular command (e.g., VernacStartTheoremProof) can be done using tags of the form repair:[CommmandTypeName], e.g., repair:VernacDefinition.

Acknowledgements

This work was supported by the Defense Advanced Research Projects Agency (DARPA) under Agreement HR00112290067.

The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

We also want to acknowledge and thank Dr. Talia Ringer and Thomas (Tom) Reichel of the University of Illinois at Urbana-Champaign for their valuable insights and contributions to this work. Talia championed the ITP 2023 paper and the overall importance of the endeavor, and Tom implemented solutions to multiple critical issues including but not limited to switch cloning, sequence alignment with numba, and dependency inference. Without their help, this work would be diminished.