Code for the paper Proof Repair Infrastructure for Supervised Models: Building a Large Proof Repair Dataset.
Please cite as
@inproceedings{reichel_proof_2023,
address = {Dagstuhl, Germany},
series = {Leibniz {International} {Proceedings} in {Informatics} ({LIPIcs})},
title = {Proof {Repair} {Infrastructure} for {Supervised} {Models}: {Building} a {Large} {Proof} {Repair} {Dataset}},
volume = {268},
copyright = {All rights reserved},
isbn = {978-3-95977-284-6},
shorttitle = {Proof {Repair} {Infrastructure} for {Supervised} {Models}},
url = {https://drops.dagstuhl.de/opus/volltexte/2023/18401},
doi = {10.4230/LIPIcs.ITP.2023.26},
urldate = {2023-07-27},
booktitle = {14th {International} {Conference} on {Interactive} {Theorem} {Proving} ({ITP} 2023)},
publisher = {Schloss Dagstuhl – Leibniz-Zentrum für Informatik},
author = {Reichel, Tom and Henderson, R. Wesley and Touchet, Andrew and Gardner, Andrew and Ringer, Talia},
editor = {Naumowicz, Adam and Thiemann, René},
year = {2023},
note = {ISSN: 1868-8969},
keywords = {machine learning, proof repair, benchmarks, datasets, formal proof},
pages = {26:1--26:20},
}
PRISM is a project management and data collection framework for the Coq Proof Assistant geared towards enabling creation and curation of AI/ML datasets for proof engineering tasks including proof repair. Major subcomponents include virtual environment management through opam switches, proof extraction utilities, and repair mining functions.
PRISM only supports Unix hosts and has only been tested in Ubuntu 20.04 and Ubuntu 22.04, so this guide assumes access to a Linux terminal. In addition, the following system packages or libraries must be installed:
prism
directory (the root of the repository).source setup_python.sh [Python 3 version (optional)]
opam init
to sucessfully install Coq.
source setup_coq.sh [Coq version number (optional)]
setup_python.sh
above).
make install
As a side-effect, this command will install and upgrade all PRISM installation dependencies,
equivalent to pip install -Ur requirements.txt
.
setup_coq.sh
).
make test
make dev
Pre-commit hooks are automatically applied each time you try to commit changes and consist
of a series of tests that must each pass before the commit is added to the repository's index.
Some of these hooks may modify files (e.g., applying code styles).
Any changes induced by the hooks must be manually staged for commit via git add
before
attempting to git commit
again.
Pre-commit hooks may be skipped by adding the --no-verify
option to git commit
.
However, skipping the hooks is generally discouraged since it merely delays the inevitable;
server-side pre-commit hooks will be applied in any branch for which a merge request has been
created and force pipeline failures (and thus prevent merges) until the hooks are satisfied.
Only an administrator, project owner, or maintainer can bypass these server-side hooks when
the target branch is protected.
Visual Studio Code with the following minimal extensions installed and enabled is the recommended IDE:
yapf
and flake8
as the formatting and linting providers,
respectively, with linting set to run upon saving of a file.
A file may be manually formatted by opening the command palette with F1 and selecting Format Document
.
Imports may also be sorted through the Python Refactor: Sort Imports
command in the command palette.
Static type checking should also be enabled with mypy
.numpy
format, use """
quote style,
start the docstring on a new line, and to not include the function name at the start of the docstring.In addition, one is recommended to set the following variables in their settings JSON:
"editor.rulers": [72, 80, 88]
and
"python.defaultInterpreterPath": "path/to/prism/venv-X.Y.Z/bin/pythonX.Y"
with the latter appropriately modified to match the location of the preferred Python interpreter.
Docstrings are generally expected to give a sufficient explanation of contents and functionality within PRISM source code. Users are encouraged to explore on their own. Some highlights are listed below:
Project
class is the central
figure of PRISM, integrating operations including building, debugging,
and dependency management.
Project
includes methods for inferring metadata, enumerating project file, and parsing Coq sentences from project files.
In practice, one will likely use ProjectRepo
objects
to work with Coq projects.Project
including build commands,
dependencies, and options for line-by-line execution with SerAPI
.MetadataStorage
object, which can enumerate the stored metadata and perform insertion, update, and retrieval
operations.
A serialized example covering the projects in the
dataset can be loaded with MetadataStorage.load
.SerAPI
class of CoqGym, this class provides
a wrapper around the sertop
executable of coq-serapi
and provides tested support for 7 minor versions of Coq
ranging from 8.9 to 8.15.SerAPI
that tracks document context and aggregates proofs into
objects with abstract syntax trees (ASTs), goals, hypotheses, and fully
qualified identifiers.
The CommandExtractor
provides more straightforward rollback mechanisms than
SerAPI
.ProjectCommitData
,
which captures the build environment for reproduction in addition to the
comments and commands of each file.
A ProjectCommitData
can be dumped to file using
ProjectCommitData.write_coq_project
to reproduce the raw text of
the commit's Coq files.RepairInstance
class provides the core structure of a repair example as an
erroneous project state (and ErrorInstance
) paired with a repair.
Two concrete subclasses, GitRepairInstance
and ProjectCommitDataRepairInstance
,
specialize the base class to Git diffs with commit SHAs and ProjectCommitDataDiff
s with
ProjectCommitData
s, respectively.
The former provide standalone representations of repairs whereas the latter require
prism
to efficiently operate.
Methods to construct each are provided.opam
, enabling the creation, deletion, and cloning of switches.
Note that cloned switches can only be used through the Python API (namely the
OpamSwitch
class) as opam
is ignorant of them.opam
switch,
or virtual environment/sandbox, in which a project can be built in isolation.
Enables the installation and removal of dependencies.
Note that if the OpamSwitch
represents a clone, then the opam
-managed
original switch must not be deleted until the clone is no longer needed.SwitchManager
are provided, but the
AutoSwitchManager
is the easiest to use and
implements the features described above.
\
\
Warning The SwiM can easily and quickly consume disk space.
To clear the auto-generated switch clones, run rm -rf /path/to/your/opam/root/*_clone_*
.
\
\
Warning The SwiM assumes complete control of the opam
root.
It is not safe to perform normal opam
operations or operate two SwiMs at once on the
same root directory.
The SharedSwitchManager
provides a multiprocessing-safe
implementation.Standalone scripts that provide both core and superficial utilities are
housed in the scripts
directory.
Note that these scripts lie outside CI and are not necessarily
maintained.
Some important scripts are highlighted here.
The primary purpose of PRISM is to enable mining of proof repairs from Git repositories. The following steps are recommended to perform repair mining (after following the directions in Getting Started). Repair mining is divided into two phases: cache extraction and repair mining.
In the cache extraction phase, we iterate over the commits of one or
more projects and independently attempt to build them before extracting
line-by-line context.
The ultimate output for a given commit is a
ProjectCommitData
.
The cache files are stored in a directory structure defined by the
CoqProjectBuildCacheProtocol
.
The extract_cache.py
script is the
main entrypoint for cache extraction and will handle creation of a base
pool of switches for a SwiM to use.
Arguments to the script allow one to limit extraction to a subset of
projects, Coq versions, commits, or files.
See the script's help message for more details.
Builds are attempted for supported Coq versions in reverse order.
Extraction of a large number of projects or commits can take a very long
time, so it is recommended that one uses a terminal multiplexer such as tmux
.
Progress can be monitored in a separate shell with the
monitor-caching.py
, which will show the
number of successful and failed extractions.
Do not be alarmed by a large number of failures (shown as errors), as some
commits simply cannot be extracted.
Failures are grouped into three categories: build errors, cache errors, and miscellaneous errors. These errors may be treated as symptoms; determination of the underlying cause of failure is a manual procedure. Build errors indicate that something went wrong during a build process and may indicate either a buggy project commit or incorrect metadata that could not be repaired by builtin inference mechanisms. Cache errors imply a bug in the extraction code itself, either from a flawed assertion/assumption or some other logical error. Cache errors are the only category indicative of something unambiguously wrong, but each such error generally impacts only a subset of commits. Miscellaneous errors are captured exceptions that do not fall within the previous two categories. Failure to obtain a switch from the SwiM due to unsatisfiable dependencies is a common miscellaneous error, which may or may not be indicative of flawed metadata.
If extraction of a project commit encountered an error, one can open the error log in the cache to obtain a stack trace, which may also identify the Coq file and line at which the error occurred. Running extraction limited to just the project, commit, Coq version, and file enables one to use a debugger to more precisely identify the cause and possible fixes.
Extraction contains multiprocessing components across projects and files,
so be sure that the number of workers chosen is appropriate for your available
resources.
In addition, be prepared for a significant consumption of disk space from
cloned switches created by the SwiM.
Finally, note that even though --max-proj-build-runtime X
will abort an extraction
if the project takes more than X
seconds to build, a timeout on the actual
extraction of commands via line-by-line execution may last longer.
In the repair mining phase, we consider pairs of extracted commits and mine
repairs from their changes.
The ultimate output is a collection of pairs of files, each storing a serialized
GitRepairInstance
and
ProjectCommitDataRepairInstance
, respectively.
The mine_repair_instances.py
script is the
main entrypoint for repair mining and takes a configuration file as its only argument.
A sample configuration is provided by
mine_repair_instances
.
Note that one should adjust the cache_root
, repair_instance_db_directory
, and
metadata_storage_file
to match the working directory when the repair mining script
is called.
In order to prevent successive repair mining runs from clobbering one another,
especially when there have been changes to prism
, it is recommended that one first
generate a configuration with produce_config.py
,
which will generate a base configuration file with the current commit as the name
of the root directory.
The configuration file simply stores a JSON dictionary containing keyword arguments
to the repair_mining_loop
, where one can find
a description for each.
Repair mining contains multiprocessing components and loading large files into memory,
so be sure that the number of workers chosen is appropriate for your available
resources.
Choosing too many workers may result in difficult-to-debug out-of-memory errors.
Note that fast
is enabled by default in generated configurations, which strips the
loaded caches (and resulting ProjectCommitDataRepairInstance
s) of ASTs and other
expensive to process/store data.
One may wish to perform repair mining on (disjoint) caches on one or more computers.
Repair instance databases produced in this manner can be merged after the fact using
the merge_repair_instance_dbs.py
script,
which will deduplicate any repairs in their intersection.
Some basic summary statistics can be collected using the
filter_repair_examples.py
.
For example, one may count the total number of repair examples in a database with a
given (set of) tag(s).
Tags of particular importance include repair:proof
and repair:specification
.
The former indicates a repair to the body of a proof, whereas the latter indicates a
repair to a lemma/conjecture itself.
Cataloguing repairs based on the type of Vernacular command
(e.g., VernacStartTheoremProof
) can be done using tags of the form
repair:[CommmandTypeName]
, e.g., repair:VernacDefinition
.
This work was supported by the Defense Advanced Research Projects Agency (DARPA) under Agreement HR00112290067.
The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
We also want to acknowledge and thank Dr. Talia Ringer and Thomas (Tom) Reichel of the
University of Illinois at Urbana-Champaign for their valuable insights and contributions
to this work.
Talia championed the ITP 2023 paper and the overall importance of the endeavor, and
Tom implemented solutions to multiple critical issues including but not
limited to switch cloning, sequence alignment with numba
, and dependency inference.
Without their help, this work would be diminished.