Zistack / program-modeling-language

A project with the goal of designing and implementing a program modelling language that allows for the efficient evaluation of arbitrary static assertions about the program's behavior.
0 stars 0 forks source link

What questions can be asked, and answered. #3

Open JacquesCarette opened 3 years ago

JacquesCarette commented 3 years ago

Issue #2 has established that the purpose is a new PL that is analyzable.

But that is much too broad. To be more concrete, one needs to figure out a starting set of "answerable questions". The menu is huge - there are tons of type systems, abstract domains (related to abstract interpretations), contract languages, or systems with refinement types that have some claim to already doing this. So being precise about the properties that are in scope is quite important.

Zistack commented 3 years ago

OK, so the goal is something I am calling 'representational transparency'. I am not sure how to set up a formal definition just yet, but the basic idea is as follows. We have a programming language, and we're using it to model some domain. This domain model is an embedded language. The idea is that the programming language does not significantly add cost to analyzing the embedded language. Most specifically, at most a polynomial factor is added to the cost of analyzing the embedded language. There may be ways of being more specific about the cost that a language adds to reasoning that might be more useful than this definition, and that can be considered later.

As far as what set of answerable questions I am referring to when I say 'analyze', the short answer is 'yes'. I don't know if you are familiar with the literature on knowledge compilation, but what I'm shooting for is isomorphic to their goals. They studied the cost of using various logics for the purpose of building knowledge databases. They were typically concerned with the cost of performing a pretty consistent set of queries - the most powerful of which was about logical entailment. In other words, the ultimate goal was to take a logic that was expressive enough to actually be useful, while also supporting a logical entailment query (is statement A entailed by statement B?) in polynomial time. I don't think that they ever succeeded, but I did learn a lot about what not to do by studying their failures.

I don't know what the name for the analog of entailment is here, but I am fundamentally attempting to do what they were attempting to do, just on the other side of the Curry-Howard correspondence. The notion of asking about logical entailment basically translates into asking whether or not program A is a specialized version of program B. An example of two programs with this relationship would be a template program, and an instantiation of said template program. Dynamic polymorphism can be used to construct programs with such a relationship as well. Of course equivalent programs satisfy the relationship bidirectionally, though in a system where entailment is affordable, there are usually simpler algorithms for checking equivalence available that are just as efficient.

With this efficient analog to entailment in hand, you can do everything that all those other systems you listed can do and more.

JacquesCarette commented 3 years ago

I don't know about 'knowledge compilation' at all! Can you point me to an introduction? I do know about description logics, perhaps that is related?

Are you mainly interested in this 'specialization' question? It's still very hard, as decent engines will do a lot of simplification while instantiating. My various papers on partial evaluation being just one example.

Equivalence is a fairly different relation than that of specialization.

I'm not entirely sure of the relation between "domain model as an embedded language" and these analysis questions. Can you give me a very concrete example?

Zistack commented 3 years ago

I don't know about 'knowledge compilation' at all! Can you point me to an introduction?

This is a pretty good overview, with decent references: https://jair.org/index.php/jair/article/view/10311/24622

Note, I don't really consider myself an expert in the domain (though I have read a few of the referenced papers, and have at least heard of almost every approach that they discussed, along with some that they didn't... :thinking:). I mostly was reading about some of their tricks as I was exploring possible ways I might try to deal with indexing and memory during attempt 1.

To clarify some things that I said earlier, I'm talking about establishing a language for which the analog of sentential entailment has certain performance properties, though I am not shooting for hard polynomial bounds in all cases, as I would be surprised if that was actually attainable.

Equivalence is a fairly different relation than that of specialization.

Yes, but two programs are equivalent iff they are both specialized versions of each other, so this specialization query can easily be used to construct equivalence queries. The paper I linked makes the analogous remark.

I'm not entirely sure of the relation between "domain model as an embedded language" and these analysis questions. Can you give me a very concrete example?

The domain that you will most likely run into first when writing any serious software is going to be arithmetic. However you decide to implement arithmetic, that implementation is essentially the definition of an embedded language. In slightly different terms, it's just the definition of a language, using this programming language as the meta-language.

The reason that I talk about embedded languages is because I highly doubt that I will be able to get hard polynomial-time bounds on the costs of the queries that I am interested in if I consider all possible programs. There will always be ways to confuse the analysis engine and make it work extra hard.

The underlying assumption that this work makes is that at least some of the complexity that exists in analyzing programs right now is due to properties of the programming language itself obscuring the properties of the embedded languages from the analysis tool, causing it to have to work extra hard to either extract those properties, or simply work around the fact that it can't see them. I am trying to design a language which does not obscure any parts of the embedded languages that we choose to represent within it. Thus the name of the desired property: representational transparency.

Besides that assumption, I also think that there is a way of extending my definition of a language and my definitions of the constraints and of the desired property (representational transparency) so that we can think about it for embedded languages as well. In principle, if all languages defined in a program are representationally transparent, then analyzing the program will be doable in poly-time. I have actually given some thought as to what this might look like for arithmetic, but I haven't spent loads of time on it since I haven't really finished my foundation yet.

Representational transparency does not assume that the embedded language is cheap to reason about. Arithmetic, as a general rule, is not cheap to reason about (but I don't think it has to be this way). As such, this work will not immediately make every program cheap to reason about. We will have to go through all of the languages that we use and try to make them all as cheap as possible to reason about in order to get full benefits. I think that most, if not all of the mathematical languages that we have developed can likely be redesigned in such a way as to be representationally transparent.

Don't get confused about my objective here, though. I want to get this language done first. I consider the generalization of properties and the construction of representationally transparent languages for other domains to be a whole other project for after this one is basically done. 'Future work', if you will.

Zistack commented 3 years ago

I'm working on a write-up of the motivations of this project. In this write-up, I have to define my entailment analog query to show how it relates to a bunch of other queries that people would recognize (both things we already do, and things we would like to do).

Because programs typically refer to a whole set of memory locations with a single variable name, I need to bring over semantic entailment as it is typically defined over predicate calculus. I was wondering if there was some sort of established terminology for talking about the different 'instances' or 'memory locations' of a variable that I could lean on that wouldn't be too specific to any particular interpretation of how computation is performed (For example: I don't really want to use the notion of a memory location because it would break down when variables could formally have highly-structured/non-trivial values). Know of anything?

JacquesCarette commented 3 years ago

At some level, the question doesn't make any sense: in languages like ML, a let-bound variable is a name for a value, and may never be stored in memory (heap) per se. It'll it a register, maybe.

In imperative languages, there is a conflation of the idea of a name for a value, and a label for a memory location. This is a very bad idea.

Lastly, you seem to indicate that this is not a static relationship. Now, most of the time, a single label will be associated to a single memory location, and what can change is what's stored there. Your language makes me think that you're seeing variables as even more dynamic than that. Are you somehow implicitly referring to pointer-valued-variables?

First and foremost, variables are always just names for other things. Whether it is a name for something simple or something complicated, should not matter at all.

Well, it does matter if you're using a meta-variable to quantify over shapes of values. If you're using a meta-variable to be the name for a whole tree or a whole linked-list, then yes, that does get more complex. Is that what you're talking about?

Zistack commented 3 years ago

I should clarify: Forget about any particular language for the purposes of this question here, including mine. I'm just asking if there is any common sort of terminology to refer to the sorts of relationships variables (as seen in the program text) have with values (as seen during a program's execution - in an abstract sense, not necessarily as seen by hardware).

I can't really define semantic entailment over programs without being able to talk about which values/variables matter and which ones don't. Logical statements in simpler calculi can always be written such that only the interesting bits have names, so logical entailment can have a nice simple definition in this case. Programs cannot always work this way, and so need a better treatment of what is actually important for the purposes of entailment.

Now, most of the time, a single label will be associated to a single memory location, and what can change is what's stored there.

Consider a program in C++ that describes a linked list. There will be member variables that have a single name, but refer to many memory locations because there are so many 'instances' of that variable. I think for most real programs, this is the situation for most variables. A variable in a loop or a recursive function (or even just a function called multiple times) would have a similar relationship with its values. I don't consider such a variable to have one name for just one value.

I suppose that restricting the scope of our consideration would largely eliminate this ambiguity, though, so maybe I don't have to worry about this particular bit so much in my treatment (though it does still come up when talking about declarative languages, in general).

Well, it does matter if you're using a meta-variable to quantify over shapes of values. If you're using a meta-variable to be the name for a whole tree or a whole linked-list, then yes, that does get more complex. Is that what you're talking about?

Unfortunately, yes.

JacquesCarette commented 3 years ago

I'm just asking if there is any common sort of terminology to refer to the sorts of relationships variables (as seen in the program text) have with values (as seen during a program's execution - in an abstract sense, not necessarily as seen by hardware).

And I answered that part, for the usual notion of 'variable' and 'value': a variable is a name / label for a value. That's it.

I can't really define semantic entailment over programs without being able to talk about which values/variables matter and which ones don't.

Separation logic? That's basically the whole reason that it was created, to talk about "the ones that don't matter" in a nice way. In normal logics, you basically have to enumerate all the ones that don't matter and explicitly state that they don't, which is horrible. In separation logic, you talk about the ones that matter "by name" and explicitly say that "the rest" don't matter as a frame rule.

Consider a program in C++ that describes a linked list. There will be member variables that have a single name, but refer to many memory locations because there are so many 'instances' of that variable.

This feels like a logical error. A given member variable will refer to a single memory location. An abstract member variable will refer to an abstract, but still single, memory location. It feels like you're quantifying over something that you shouldn't.

A variable in a loop or a recursive function (or even just a function called multiple times) would have a similar relationship with its values.

Same thing. At any point in time, a variable has a single value. Because you don't know that value, all you can say is that it ranges over a particular set (or type, depending on your formalism), possibly described with constraints.

Are you looking for the concept that a variables 'ranges' over a set of values? The $$\in$$ of set theory and $$:$$ of type theory?

You also need to be precise about whether you're talking about variables (ranging over values) or meta-variables (ranging over expressions /terms that evaluate to a value). You further need to be precise about what your values are -- a data-structure (like a linked list) is a very different kind of 'value' than a character. In some languages, a linked list is not a value (in C, for example) at all. In Java, it can be.

Zistack commented 3 years ago

I'm not sure how to address that response directly. It is clear that there is a fundamental difference in perspective going on here that is difficult to articulate. It would appear that terminology such as what I am asking for does not currently exist. Let me restate my goals in a way that tries to eliminate room for hidden assumptions.

So, I want to define a program entailment operator that is language-agnostic. The idea is that you could look at the definition and see how it would apply to any programming language. The way that I understand this operator is language-agnostic, so it seems like it should be presentable in that way as well.

It does not make sense to talk about variables having different values at different times in this context. I am basically assuming something like a SSA perspective on any given program (assuming that the language itself does not already have this property/form).

In order to specify this operator, I have to be able to talk about assignments of values to some subset of variables, and I have to be able to talk about the 'consistency' of that assignment with respect to the structure of the program. I get entailment when I can say something to the effect of 'program A entails program B with respect to some set of variables when all consistent assignments of values to that set of variables in program A are also consistent assignments of values to those variables in program B'.

There should be restrictions on the set of variables that may be specified such that you can't choose to consider a variable that is assigned to multiple times inside of a loop (which would give our comparison interface potentially incomparable shapes). I need to be able to make a distinction between such variables and those variables which really only correspond with a single value per execution in order to express that constraint.

JacquesCarette commented 3 years ago

So, I want to define a program entailment operator that is language-agnostic.

First question (seriously!): why?

For some entailment operator to be language-agnostic means that the only things it can see are outside the program. Some would say just I/O, others could include the heap (i.e. memory) in that as well. Others would allow timing to be observable. But certainly there is going to be no such thing as 'data-structure' as that is very much language-specific.

There are such things already: for example, institutions (a bad name, as it's hard to google). See some high-level ideas at https://courses.engr.illinois.edu/cs522/sp2016/GeneralLogics.pdf . Institutions have been quite popular with some theoreticians, but never made a dent in practice. They are too general to be useful. Their lack of specificity drives people away.

Now you talk about SSA: that brings in a lot of assumptions about your host language. Sure, it won't care if it's C or Pascal -- but it sure won't be Prolog or even Haskell.

Quite a few people use Idealized Algol as a generic PL. It has core features in common with many PLs, without any of the weird quirks of all of them. Is that perhaps what you're seeking? Not necessarily Idealized Algol per se, but maybe a "generic imperative language" that is defined more semantically than syntactically?

I still don't understand:

I have to be able to talk about assignments of values to some subset of variables, and I have to be able to talk about the 'consistency' of that assignment with respect to the structure of the program

Note that I'm not saying you're wrong or you don't want that. What I'm saying is that I see no compelling justification for wanting to do that in what you've written. A gap in your exposition, if you will.

Roughly: I have no idea why you're talking about "sets of variables". Plus, to me, that clashes quite a bit with being language-agnostic (where do these variables come from???). In other words, I continue to be very confused.

Zistack commented 3 years ago

First question (seriously!): why?

Because we can't talk about such an operation being cheap in some languages and expensive in others without being able to define it such that it applies to more than one language. Because this project is as much about how to design languages as it is about actually building a usable tool. We cannot compare the designs of different languages unless we can talk about them in terms of some common language.

For some entailment operator to be language-agnostic means that the only things it can see are outside the program.

If this is really, fundamentally true, then we're completely screwed on the communication front. It means that we do not and cannot have a language which allows us to compare languages. I don't think that this is necessarily true, but I do think that we do not have a language with which to talk about languages at the moment, and I'm entirely not sure what to do about it.

For some entailment operator to be language-agnostic means that the only things it can see are outside the program.

Annoyingly, yes, even though I think of SSA as a concept moreso than a particular form (and tend to use it that way, which might really be more confusing than helpful). I don't know if there is a proper name for the idea that I actually think of as being important here (the basic idea that each variable gets a single value per execution).

Quite a few people use Idealized Algol as a generic PL. It has core features in common with many PLs, without any of the weird quirks of all of them. Is that perhaps what you're seeking? Not necessarily Idealized Algol per se, but maybe a "generic imperative language" that is defined more semantically than syntactically?

Wouldn't this then interfere with my ability to then talk about the same operator as applied to a new language that I would design? If I specify an operator in this idealized Algol, then how in the world can I say "And now I want to design a language for which this operator is cheap.", when "this operator" only makes sense in the idealized Algol and not any other language (including the one which I would design)?

Note that I'm not saying you're wrong or you don't want that. What I'm saying is that I see no compelling justification for wanting to do that in what you've written. A gap in your exposition, if you will.

It's just a necessary part of defining the operator. When we talk about 'valid' executions of a program (I know we use such terms when discussing memory models, for example), we're talking about this sort of consistency.

Roughly: I have no idea why you're talking about "sets of variables". Plus, to me, that clashes quite a bit with being language-agnostic (where do these variables come from???). In other words, I continue to be very confused.

All languages have names that refer to values in some capacity, though the names may be entirely implicit in the text (I'm thinking of how some esoteric languages only allow you to manipulate implicit registers, often using them to address memory). These are variables. When we talk about one program entailing another, we have to say which variables are important. If we say that literally every intermediate value is important, then the only comparable programs are identical ones. Logical entailment makes assumptions about only the named values being important. All of the intermediate values that would occur when you evaluated a logical expression with respect to some assignment of values to variables are not considered when discussing entailment. We need to be able to do something similar for programs, except that in programs, we can't always write the program so that only the important variables get named (and for readability purposes, we probably wouldn't want to anyway).

Zistack commented 3 years ago

There are such things already: for example, institutions (a bad name, as it's hard to google). See some high-level ideas at https://courses.engr.illinois.edu/cs522/sp2016/GeneralLogics.pdf . Institutions have been quite popular with some theoreticians, but never made a dent in practice. They are too general to be useful. Their lack of specificity drives people away.

I've read this now. It's a very interesting paper, but I see why the formalisms presented are not very useful. While I could technically translate this into the domain of programs in a more general way, the notion of entailment defined that way would not be enlightening.

Something about the internal structure of the programs in any given language needs to be said in order to say anything useful in our case, I think. Doing this in such a way as to avoid favoring one sort of programming language over another is the tricky part.