Dataset requests - Githubissues

LasseBlaauwbroek commented 2 years ago

If you'd like any information to be added to the dataset that is not already there please list it here.

[ ] Fix printing of existential variables, with a proper sigma.
[x] Add the text of tactics without anonimyzing postprocessing
[x] Add version id to dataset
[ ] Full terms in arguments
[x] De-duplication of the graph
[ ] In addition to the decomposed tactics, also include all of the non-decomposed tactics and partially decomposed tactics to the proof graph.
[ ] Add the grammar (AST) of a tactic as a graph to the dataset
[ ] Some more metadata about constants (remove if we have this already) (Jason):
- [ ] Reference to coq file location
- [x] Text of type
- [x] Text of body
[x] If not already added, the variable name for a context element (Jason).
[x] Better documentation of the specification (Jason)
- a doc string for each item in the specification
- (If the doc strings show up in the python doc strings for the accessors, then use the official doc string format of capnp, where the doc string is below the item. If it doesn't matter, then it could go above or below.)
[ ] Some scripts which extract the following information in CSV, JSON, or JSON lines. (I've written my own in the past for some of these. I could continue to do that, I just worry about the specification changing or more data being added.) (Jason)
- [ ] Table of all definitions with lots of metadata
- [ ] Table of all proof states and their tactics with lots of metadata
- [ ] Same as above but also listing the edges and nodes in the respective graphs.

jasonrute commented 2 years ago

This is more on Vasily's side, but a way to turn the id we use in the dataloader into the id in the capnp dataset (again, if not already existing).

LasseBlaauwbroek commented 2 years ago

If not already added, the variable name for a context element (Jason).

Already existed.

Text of type

Done

Text of body

Done

Better documentation of the specification (Jason)

Done. Feel free to make further improvements.

Some scripts which extract the following information in CSV, JSON, or JSON lines.

This can of course be written, but for me these kind of things are usually throwaway scripts. You can write these kinds of things in +/- 15 lines of python code. Everyone has slightly different needs from such scripts, and I don't want to write scripts with a million options for this. So I don't think I'll be writing these. But if someone else wants to do it, that is fine with me.

Reference to coq file location

This is something that is possible, but not trivial. I'm also interested in this myself, but not really for any machine learning purposes, but rather for proof engineering purposes. It seems to me that our dataset could also be very useful to folks that are engaged with that (see also https://proofengineering.org/). However, I don't immediately see any machine learning application for this, and as such I'm not regarding it as a high priority. What kind of usecase do you have in mind? What kind of precision do you need? Do you only need location information for definitions or do you also need the locations of tactics?

jasonrute commented 2 years ago

The Coq file reference isn't a top priority. The reason I suggested it is that often I'm looking at coq data, and I wonder to myself, what does the proof or definition actually look like (as well as some of the surronding stuff). Right now I just google Coq standard library <NAME>. I can continue to use Google for now (or better find the actual repo and then use Github search).

jasonrute commented 2 years ago

As for scripts, that is fine. I'll write my own. (Jelle and I were talking recently about some data it would be nice to have extracted.) Where is a good place for me to save the scripts? I can keep them to myself of course, or share them. If I share them, is there a place in your repo, or should I put them in graph2tac?

LasseBlaauwbroek commented 2 years ago

I guess it is a good idea to put the scripts in this repo because then they are synchronized with the API. Maybe just put them in a sub-directory of pytact: https://github.com/coq-tactician/coq-tactician-reinforce/tree/coq8.11/pytact We can always reorganize this later. But please keep the scripts simple and general. The idea is not to put an entire neural network in this repo.

If you want me to update your scripts when the API changes I'll ask you to either (1) use a strongly typed language or (2) provide easy-to-run tests. If not, that is fine too, but then I'd like you to update them yourself.

LasseBlaauwbroek commented 2 years ago

The source for the current datasets is here https://github.com/coq/coq/tree/V8.11.2/theories and here https://github.com/coq/coq/tree/V8.11.2/plugins by the way (mind the tag).

I can distribute the source files with the dataset I guess. And for this dataset that may not be a problem. But when we grow the dataset, we'll have to figure out how to deal with licenses. How can we license our dataset? Is the dataset transformative? Is it fair use? Which copyright jurisdiction even applies here? Certainly we cannot re-license the source files so those we will have to license as they were originally, unless that license allows re-licensing... It's a mess. The free software foundation has a whole call-for-opinions on this topic w.r.t Copilot: https://www.fsf.org/blogs/licensing/fsf-funded-call-for-white-papers-on-philosophical-and-legal-questions-around-copilot

coq-tactician / coq-tactician-api

Dataset requests #8