coq-tactician / coq-tactician-api

An API for interfacing with Coq through Tactician by external agents
https://coq-tactician.github.io/api/introduction
MIT License
2 stars 1 forks source link

Dataset requests #8

Open LasseBlaauwbroek opened 2 years ago

LasseBlaauwbroek commented 2 years ago

If you'd like any information to be added to the dataset that is not already there please list it here.

jasonrute commented 2 years ago

This is more on Vasily's side, but a way to turn the id we use in the dataloader into the id in the capnp dataset (again, if not already existing).

LasseBlaauwbroek commented 2 years ago

If not already added, the variable name for a context element (Jason).

Already existed.

Text of type

Done

Text of body

Done

Better documentation of the specification (Jason)

Done. Feel free to make further improvements.

Some scripts which extract the following information in CSV, JSON, or JSON lines.

This can of course be written, but for me these kind of things are usually throwaway scripts. You can write these kinds of things in +/- 15 lines of python code. Everyone has slightly different needs from such scripts, and I don't want to write scripts with a million options for this. So I don't think I'll be writing these. But if someone else wants to do it, that is fine with me.

Reference to coq file location

This is something that is possible, but not trivial. I'm also interested in this myself, but not really for any machine learning purposes, but rather for proof engineering purposes. It seems to me that our dataset could also be very useful to folks that are engaged with that (see also https://proofengineering.org/). However, I don't immediately see any machine learning application for this, and as such I'm not regarding it as a high priority. What kind of usecase do you have in mind? What kind of precision do you need? Do you only need location information for definitions or do you also need the locations of tactics?

jasonrute commented 2 years ago

The Coq file reference isn't a top priority. The reason I suggested it is that often I'm looking at coq data, and I wonder to myself, what does the proof or definition actually look like (as well as some of the surronding stuff). Right now I just google Coq standard library <NAME>. I can continue to use Google for now (or better find the actual repo and then use Github search).

jasonrute commented 2 years ago

As for scripts, that is fine. I'll write my own. (Jelle and I were talking recently about some data it would be nice to have extracted.) Where is a good place for me to save the scripts? I can keep them to myself of course, or share them. If I share them, is there a place in your repo, or should I put them in graph2tac?

LasseBlaauwbroek commented 2 years ago

I guess it is a good idea to put the scripts in this repo because then they are synchronized with the API. Maybe just put them in a sub-directory of pytact: https://github.com/coq-tactician/coq-tactician-reinforce/tree/coq8.11/pytact We can always reorganize this later. But please keep the scripts simple and general. The idea is not to put an entire neural network in this repo.

If you want me to update your scripts when the API changes I'll ask you to either (1) use a strongly typed language or (2) provide easy-to-run tests. If not, that is fine too, but then I'd like you to update them yourself.

LasseBlaauwbroek commented 2 years ago

The source for the current datasets is here https://github.com/coq/coq/tree/V8.11.2/theories and here https://github.com/coq/coq/tree/V8.11.2/plugins by the way (mind the tag).

I can distribute the source files with the dataset I guess. And for this dataset that may not be a problem. But when we grow the dataset, we'll have to figure out how to deal with licenses. How can we license our dataset? Is the dataset transformative? Is it fair use? Which copyright jurisdiction even applies here? Certainly we cannot re-license the source files so those we will have to license as they were originally, unless that license allows re-licensing... It's a mess. The free software foundation has a whole call-for-opinions on this topic w.r.t Copilot: https://www.fsf.org/blogs/licensing/fsf-funded-call-for-white-papers-on-philosophical-and-legal-questions-around-copilot