data61 / PSL

Other
65 stars 9 forks source link

Neural_PaMpeR: build the database again with each line tagged with proof obligations. #178

Open yutakang opened 3 years ago

brando90 commented 3 years ago

hi @yutakang, is this the data set we spoke about? Can I help you make it?

yutakang commented 3 years ago

Hi @brando90,

hi @yutakang, is this the data set we spoke about?

Yes. I created a new directory Neural_PaMpeR at https://github.com/data61/PSL/tree/master/Neural_PaMpeR

Bits and pieces of the code base for the data extraction for PaMpeR are outdated for Isabelle2020. So, I have to update them.

And unfortunately I did not record on what date I downloaded the AFP articles to build the database I used for the CICM paper.

So, I have to build the database again, so that I can match each line of database with the string representing each proof goal.

I can use a big server machine in Innsbruck, so in terms of computational resources I think it's alright.

brando90 commented 3 years ago

Hi @brando90,

hi @yutakang, is this the data set we spoke about?

Yes. I created a new directory Neural_PaMpeR at https://github.com/data61/PSL/tree/master/Neural_PaMpeR

Bits and pieces of the code base for the data extraction for PaMpeR are outdated for Isabelle2020. So, I have to update them.

And unfortunately I did not record on what date I downloaded the AFP articles to build the database I used for the CICM paper.

So, I have to build the database again, so that I can match each line of database with the string representing each proof goal.

I can use a big server machine in Innsbruck, so in terms of computational resources I think it's alright.

ok cool let me know if there is something I can do to help to get the proof obligations.

Thanks!

yutakang commented 3 years ago

Hi @brando90 ,

I produced a sample datapoint for this lemma.

This sample datapoint is expressed in 4 lines:

Each line representing this datapoint starts with the location information (file name and line number). The lines representing first goals are tagged with "First_Subgoal", and the lines representing chained facts are tagged with "Chained_Fact".

What do you think?

brando90 commented 3 years ago

Hi @brando90 ,

I produced a sample datapoint for this lemma.

This sample datapoint is expressed in 4 lines:

  • 1 line for the results of the assertions,
  • 3 lines for the strings of proof obligations. (1 for the first subgoal and 2 for the local assumptions (a.k.a. chained facts))

Each line representing this datapoint starts with the location information (file name and line number). The lines representing first goals are tagged with "First_Subgoal", and the lines representing chained facts are tagged with "Chained_Fact".

What do you think?

Hi yutaka,

that looks good for that goal. I do want to keep the hardcoded vector representation that you came up with. The the most important is that for every node in the proof tree that we have it's raw formula. So if we go from assm1 g1 to amms2 g2 using tac then we want:

(assm1 g1, Tac)

the next set of goals would be a data point for the next tactic (not included in my example). I think your example might be too simple to see if it's robust but I think it's ok. What we want is to know what tactic was done to prove a specific goal with a specific local context.

If the tactic took an argument, we'd also like that to be part of the target e.g.

(assm1 g1, Tac arg)

Does that make sense?

brando90 commented 3 years ago

@yutakang actually perhaps this the best thing:

  1. I think what tactician suggests is better: "tactica1, is saved into the database accompanied both by the proof states before and after tactic execution, in this case, <Γ_a1 |- σ1, tactic_a1, Γ_a2 |- σ2" is good.

I still think the arguments to the tactic should be saved.

For future work, I think event just saving the entire proof tree from a proof script would be best. That way ML researchers can extract what they want (e.g. parents, steps, anything from the proof tree but perhaps that's for a different discussion but I'd be curious on how easy from an engineering perspective this is to do for Isabelle)

brando90 commented 3 years ago

by proof tree I mean the tree from the tactics (not from the "logicians" perspective)

brando90 commented 3 years ago

this is ultimately the proof I am suggesting:

(current_goal, assumptions, tactic arguments, resulting obligations)
brando90 commented 3 years ago

@yutakang

Summary of discussion:

We want to create a data set that is as complete as possible (mainly because its hard to recover information that is missing once it's created in an offline setting). I suggest thus we have something like this:

data = (A1==>G1, assms, Tac args args_names, A2==>G2)

does this sound good? Is there anything missing?

brando90 commented 3 years ago

@yutakang what is the difference between A1 vs assms?

brando90 commented 3 years ago

hi @yutakang! How are you? How is the data set we discussed going?

Regards, Brando

yutakang commented 3 years ago

Hi @brando90 ,

Sorry, my only brother was dying, which was causing lots of problems. To handle the situation I limited my time for research to the lowest possible level, which was just enough not to be fired... at least so far.

I am back in Singapore and resuming Isabelle-related work in my spare time. Sorry, it might be already too late for you and your folks by now.

By the way, the notification from GitHub sometimes doesn't work for me. In case you don't receive a reply from me, please send an email at yutaka@yale-nus.edu.sg or tweet at YutakangE.

Best regards, Yutaka

brando90 commented 3 years ago

Hi @brando90 ,

Sorry, my only brother was dying, which was causing lots of problems. To handle the situation I limited my time for research to the lowest possible level, which was just enough not to be fired... at least so far.

I am back in Singapore and resuming Isabelle-related work in my spare time. Sorry, it might be already too late for you and your folks by now.

By the way, the notification from GitHub sometimes doesn't work for me. In case you don't receive a reply from me, please send an email at yutaka@yale-nus.edu.sg or tweet at YutakangE.

Best regards, Yutaka

No Worries Yutaka, family is really important. Recently my father also had a terrible complication - likely as severe as yours.

It's not to late to create the data set we were chatting about but it might take me longer to implement agents to benchmark it or get a team to try that. Let me know if you do have time for that.

Regards, Brando

brando90 commented 3 years ago

btw, I did recently made my repo public in case someone wanted to help us out: https://github.com/brando90/isabelle-gym/issues/30

yutakang commented 3 years ago

Hi @brando90 ,

I am sorry about your father's situation.

I have recovered a little and started working on this repository in my spare time. Josef at CTU in Prague told me that I can use machines to extract data.

So, probably I can extract the dataset again.

I check this repository every once in a while. But if you don't receive a reply from me, please send an email at yutaka@yale-nus.edu.sg or tweet at YutakangE.

brando90 commented 3 years ago

Sounds good!

Feel free to message me when your data set is ready. No pressure!

Hope you and your family is doing better and you should put that in the priority. This can wait.

Regards, Brando

On Jul 18, 2021, at 11:57 AM, Yutaka Ng @.***> wrote:

Hi @brando90 https://github.com/brando90 ,

I am sorry about your father's situation.

I have recovered a little and started working on this repository in my spare time. Josef at CTU in Prague told me that I can use machines to extract data.

So, probably I can extract the dataset again.

I check this repository every once in a while. But if you don't receive a reply from me, please send an email at @. @.> or tweet at YutakangE.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/data61/PSL/issues/178#issuecomment-882086515, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE6LXZLGNKMR2BRBUWZE3TYMBXDANCNFSM4REIYK5Q.