Lonami / pyndustric

A Python compiler for Mindustry's logic processors assembly language
MIT License
48 stars 9 forks source link

Supporting vectorized operations #6

Open TelosTelos opened 3 years ago

TelosTelos commented 3 years ago

Mindustry is a 2D game, and much mindustry logic involves effective manipulation of 2D x,y vectors, where the same thing is done with both the x and y components. Python coders familiar with NumPy will find such redundant coding tedious and potentially error-prone, and would prefer vectorized operations, which combine these parallel operations together. E.g., if you want the midpoint between a unit and a target, it would be nice to be able to use something like (unit.pos + target.pos) / 2, rather than needing to compute .x and .y components separately. (Note: related to issue #2, it'd also be much better to be able to put this all on one line, rather than on multiple lines.)

In pure Python (like NumPy), such operations are quite easy to support by overloading the relevant operators, like + and *. Things will be quite a bit trickier though when compiling to mlog, as we won't be able to count on Python to automatically keep track of which variables are supposed to use overloaded operations and which ones aren't. Still I think it should be relatively straightforward to have the compiler track which variables are meant to track 2D vectors, and to compile them into paired lines of code (one for each of x,y). To a first approximation, this would require maintaining a set listing such variables, adding things to this set when created by a method that should produce a vector (like retrieving an objects .pos, or using a vectorized operation that outputs another vector), and then compiling later uses of such variables into appropriate de-vectorized pairs.

Relatedly, it may make sense to allow units and blocks themselves to participate in such vectorized operations. So, e.g., dst(unit-player) would be equivalent to dst(unit.pos - player.pos) which would itself be equivalent to dst(unit.x - player.x, unit.y - player.y) which of course would be equivalent to even more long-winded things using sensors to retrieve these attributes, and/or breaking this all apart into a long sequences of single-operation lines).

Lonami commented 3 years ago

I like the idea of supporting vectors out of the box, but I don't think units or containers should behave as positions. A unit is not a position, neither is a container.

To make it clear that accessing this position is not zero cost (requires a sensor "call"), we might want to do something like unit.pos().

Although if operators like + start emitting more than one instruction unit.pos isn't so bad.

TelosTelos commented 3 years ago

Yeah, I think that mlog is so low-level assembly/like that there shouldn't be any expectation of 1:1 correspondence between Python code and underlying instruction count. The goal should be to have Python expressions match up with intuitive ways of thinking about the stuff, which often may (and should!) require numerous low-level instructions to implement.

As for requiring .pos, I think it's pretty intuitive to ask for the distance between two units, or to order one unit to approach another, rather than pedantically insisting that the coder ask for the distance between their positions, and to command the one to approach the other's position.

TelosTelos commented 3 years ago

So, it looks to me like there are two main hurdles on the path to vectorized operations.

(1) First, we need to detect which values need to be treated as vectors and which need to be treated as scalars. E.g. when we encounter the expression a+b, are these two scalars so the result is the scalar a+b? Is just a a vector so the result is the vector [a.x+b, a.y+b]? Is just b a vector so vice versa? Or are both vectors so the result is `[a.x+b.x, a.y+b.y]?

The two main options here are (1A) to use explicit cues from the coder, like using a.xy in places where they want a to be treated as a vector, or (1B) automatically detecting what is a vector. Auto-detection will probably be easier if we assume "once a vector, always a vector" for each variable. Auto-detection could conceivably work in both directions: e.g., some things are known to be vectors (explicit vectors like [0,0], special at-variables like Unit and This, globals like vault1 that are referred to before being assigned to (though distinguishing these within functions may be tricky), outputs of operations that are known to produce vectors, like radar and ulocate) so anything produced by a vector-producing op on those things will therefore also be a vector; and some operations like Unit.move_to are known to require vector arguments, so we could in theory work backwards from those uses to determine that the previous ops that produced those args must also have produced vector outputs (though it may be hard to distinguish which of their inputs were vectors, e.g. because a+b can produce a vector output with a being scalar, or with b being scalar. I'm inclined to think backwards detection like this is more trouble than its worth, since it would definitely require multiple passes. The hardest cases for autodetection will probably be user-defined functions, since they offer less in the way of preceding uses of the arguments, esp. when first compiled. So this may be a case where some form of explicit cue is practically required.

(2) The second hurdle is figuring out when to ask the game for .xy coordinates of an object (typically using a pair of sensor instructions) and when to trust that we already have up-to-date coordinates stored in relevant variables (e.g. a_x and a_y) so don't need to waste instructions reloading these values. Again the two main options are (2A) to use explicit cues from the coder, or (2B) to implicitly detect when to update .xy info for a vector. For auto-detection, ulocate commands output uptodate .xy info"for free" so that should always be used, and some @variables like @this apparently have .xy info available "for free" as @thisx and @thisy so this should always be used. For other known-to-be-vector objects, their first participation in a load context for a vectorized op requires an explicit sensor lookup. Later references to such objects are trickier, since variables storing .x and .y info will exist for them, but this info may be out of date, e.g. if time has passed or the variable has been reassigned to refer to a new object. Proper auto-detection of this might require serious flow-of-execution tracing, to determine what all may have happened since the last time we were assured that the local .xy info for a vector variable was up to date, but I don't think we're in a position to do that. So we're stuck with cruder hueristic-like auto-detection and/or explicit user cues.

TelosTelos commented 3 years ago

It seems likely that we won't be able to do everything with (B) autodetection, so we'll need to (A) allow at least some explicit cues, both (1A) that an expression is a vector, and (2A) that we need to look up current values for it. So I'm tempted to implement this just with explicit cuing first, as a sort of proof-of-concept and useful working model, and then maybe add whatever layers of autodetection we feasibly can later.

So that faces us with the question of what explicit cueing to use. I'm currently inclined (1A) to compile .xy as though it is an attribute that defines overloaded vectorized operations. If we're using only explicit cuing, that would mean that any vectorized operation will need to be formatted along the following lines a.xy = c * b.xy + [0,1] (where c is clearly scalar here due to having no .xy, and two element lists are always construed as vectors).

I'm currently inclined to use a.updated.xy as the explicit cue to use sensor commands to provide up-to-date .xy info for object a. This can behave in larger expressions much as a.xy would.

Under the hood, I think I'll make as_value and as_target return a pair of values (e.g., [a_x,a_y] or [__temp22_x, __temp22_y] when they know they're dealing with a vector, and then I'll need to rewrite the various uses of these to check whether any operands are lists like this, and if so "broadcast" the non-lists to match them in a vectorized operation.

Lonami commented 3 years ago

I've lost track on why .xy is trickier. Isn't it trivial? .xy always gets expanded to two operations with _x and _y (or whatever suffix/prefix), and .x or .y get expanded to one, their respective variables.

We treat vectors as immutable, much like single values, meaning they can't be changed in-place, and every change to a vector produces a new value. Thus this:

a.xy = vec(1, 2)
b.xy = a
a.x = 3
print(b.xy)

…shows (1, 2).

Basically .xy enables users to treat anything like a vector. The downside is they need to type this every time they want vector-like access.

Instead, the option of tracking types should be fine, but I think we should raise an error when the same variable is used as two different types if it occurs in any condition or loop (because the type "may or not change" and we can't know).

Supporting the jumps from Python you suggest also makes type tracking at compile time impossible.

TelosTelos commented 3 years ago

It sounds like we're leaning in the same direction for now, of including lots of .xy's in vectorized operations.

I guess I've used NumPy (and other operator-overloaded NumPy-like setups) enough that I'd rather not have to include all the .xy's and would rather refer to vector-like objects simply by name in mathematical operations. But those systems work because Python has a full-fledged class system, whereas mlog has only whatever approximation of this we manage to implement ourselves, and it may well be that the best we're willing to do requires explicit .xy's all over the place.

Good point about flow-tracing becoming even trickier when arbitrary jump-destinations are allowed. Glad I'm not planning to do flow-tracing!

I agree that, if we start doing much automated detection imposing a "once a vector, always a vector" rule is probably a good idea, including enforcing it with compiler errors. This helps with (task 1) determining what is a vector, but unfortunately doesn't help much with (task 2) determining when a vector needs to be updated from sensors.

I don't think "immutable" is quite what you mean here. E.g., if a.xy were an immutable named-tuple, your a.x=3 would be disallowed, or at least would have to be idiomatic for something that replaces a with a new immutable tuple. I'd personally think of these as mutable (e.g., it's fine to change just a.x while leaving the rest of a unchanged). I'm torn two ways regarding whether b.xy=a should always be equivalent to b.x, b.y =a.x, a.y as I think you're suggesting, or whether it should instead be compiled like b = a in cases where a refers to a mindustry unit or building (the only "object-like" things mlog has access to), and b would end up being another pointer to that very same object. On the one hand, I'm inclined to read b.xy=a to mean b = a #type hint: b is vector-like which could reasonably yield the second two-pointers-to-the-same object behavior, whereas I'm also inclined to parse it as a macro for b_x, b_y = which would yield the first, copying, behavior. Hrmm....

Lonami commented 3 years ago

Python has the luxury of doing all those operations on different types because the checks occur at runtime, but we need to do it at compile time.

I think tracking the types would be worthwhile to save from all .xy which can quickly become noise, and flow analysis can be kept simple (if any branch changes the type, it's an error).

With regards to immutable, I mean the following. When you do x = 7, you cannot change that value.x is a reference to the int object 7. You can only update the reference x points to.

Similarly, changing vec.x would cause vec to point to (reference) a new vector (a copy of the original, with a new x value). Essentially, the original vec reference is still valid, and has not mutated, which means any other variables referencing that same old vec don't see the changed x.

Thus the value immutable, meaning we don't need to track and update all references, and the snippet I described in https://github.com/Lonami/pyndustric/issues/6#issuecomment-760705648 is valid. If vectors were mutable, it should've printed (3, 2), and this is when it becomes a lor harder. But they're immutable and we make copies.

TelosTelos commented 3 years ago

Stepping back from this, auto-detection is quite tricky if we continue doing everything in one pass, esp., if we continue to compile function definitions prior to seeing what will call them, so won't yet have any clue what the arguments will be. E.g., suppose we're compiling the following function:

def halve(v): return v/2 

If you look just at this function, there's no way of telling whether the argument will be vector or scalar, so no way to tell whether this should compile to a single division op or a pair of division ops. The only ways to answer that question are (A) to demand that the user somehow explicitly indicate which this is, e.g., by making the content v.xy / 2, or (B) by examining the arguments that will be passed to this function before finalizing our compilation for this function.

Another wrinkle is that for macro-like inline functions, it could be perfectly sensible to make some inline instances vectorized and others not, depending upon each instance's arguments. E.g., halve( scalar ) might compile to shorter code than halve( vector ) would. If we're going to allow this, then we'd have to hold off on finalizing which ops within the function will be vectorized.

Part of me is suspecting that there are two quite different approaches to this that we'll need to choose between: (A) require that every use of a vectorized op somehow be unambiguous on the first pass, often due to explicit type hints, esp. within function defs, so that we can correctly choose vectorized or unvectorized compilation on first pass, or (B) treat vectorized ops much as we now treat line numbers, and function bodies, by initially compiling them into some precursor that will be transformed into completed code in some later pass.

Toying with an idea here: let's imagine that during the first pass, we compile all vectorized ops into their scalar version, so e.g. u.xy + v.xy compiles in the first pass just like u + v. But we also maintain a set of all variables (and of every non-inline function argument -- inline args evaporate when inlined) that we have become convinced are supposed to be vectors, including ones that are explicitly cast as vectors, ones that are outputted by ops that output vectors, and ones that are used in ways that duck-type them as vectors. This set will effectively be a closure under various relations, so it may take several iterations/passes to grow out a stable assessment of which variables are vectors and which aren't. Once we reach such a stable assessment, we go back through and replace every vectorizable op that has at least one vector argument with a pair of ops, just like the original except with _x or _y appended to all the vector variables. (This helps a lot with the first hurdle above, determining which variables and ops need to be vectorized, but not with the second hurdle, determining when to use sensors to initialize/refresh the _x and _y values of a vector. This also would need some finessing to handle vector constants like [0,100].)

TelosTelos commented 3 years ago

One wrinkle for automatic type-checking is values read from memory cells, which can be scalar or can be references to vector-like units/buildings. Other than this, I think it would be fairly straightforward to infer types from the operations that assign values to variables, assuming/enforcing that the same variable must have the same type in all occurrences, and that never-assigned variables must have been vector-like blocks like vault1. If we also type-track the difference between abstract/pure vectors and vector-like units/buildings, we can also get a fairly decent solution to the "second hurdle": make all references to units buildings read .xy values from sensors, and leave the coder responsible for storing these in another variable if they want to save instructions by re-using recent sensor readings.

Still, I think automated type-tracking may be a bigger enterprise than I'm willing to take on right now (perhaps ever), so I'm leaning towards a useful enough, even if somewhat cluttered, approach where every element that will be vectorized should be explicitly marked as such, with .xy or brackets [ ] listing two components. I'll still be stuck doing a bit of automatic type-selection for temp-variables in complex expressions, but hopefully that will be manageable, since almost all operations with vector inputs yield vector outputs (with the exception of dst, atan2, and some user-defined functions).

I'm tempted to have the first pass just create single commands with lists in place of vector elements: e.g., a.xy = b.xy + 1 initially compiles to op add [a_x, a_y] [b_x, b_y] 1 and then have a later pass go through and split such instructions into pairs. If we someday decide we want to add more automatic type-checking, this will leave us a convenient window to vectorize some more operands before the pairs get split and harder to deal with.

Lonami commented 3 years ago

IIRC you can only store numbers in memory cells, not strings, not references (it would be good if you confirmed this).

TelosTelos commented 3 years ago

Ah you're right. I thought I had done this, but what I'd actually done was store the @x and @y of a unit, rather than the unit itself. (I was trying to divide labor between a "watcher" processor whose job was to keep finding the player, and "follower" processors whose job was to keep units from straying too far from the player, so the watcher needed to signal the player location to the followers via memory block.)

Anyway, that makes things a bit easier for type tracking. In fact, so long as we can assume (a) that each variable always has the same type, (b) that special @variables and constants are of known types, (c) that all other never-assigned variables are of type Block, (d) that out-of-line function arguments are always the type of any value passed to the argument, and (e) that the type of each instruction's output is fully determined by the type(s) of its inputs, then there will be no ambiguity at all about types, though discovering them may take repeated passes through the code. (E.g., in a convoluted program, it might not emerge until near the end of the first pass that a is a vector; it might not emerge until late in the second pass that b is therefore also a vector; and so on..., but this process is monotonic and finite, so it's guaranteed to come to rest eventually, and in most programs it'd only take a couple passes. Alternatively, you could probably build a graph of type-entailment relations in a single pass, and then it'd be much more efficient to derive types for everything via it, but probably more work than it's worth.)

I'll have to do set much of this machinery up regardless, just to derive types for temporary variables in complex expressions. Now now that I have a clear picture of how this machinery could be used to determine all types with no need for explicit type-hinting, that's making me lean again in the direction of doing this without any explicit .xy's. (I guess we'll still need to make some decisions about what all gets copied in a = Unit: just the pointer to the unit, just its .xy coordinates, or both? The answer to this may or may not give us a good answer to the second hurdle above, determining when to use sensor instructions to look up the position of a unit.)

(BTW, when I tested this, I found that writing a unit to a memory cell apparently stores 1 there, not very useful. You can print units though, which outputs their type (apparently a string).)

Lonami commented 3 years ago

I like how you keep tagging people with @ghost names like @variables and @counter. Good thing they're not active accounts (or don't seem to be).

TelosTelos commented 3 years ago

Yeah, I noticed that too. Embedding in back-quotes seems to prevent that, but is hard to remember.

TelosTelos commented 3 years ago

Here's the notation I've tentatively settled upon.

For most ordinary purposes, you can refer to any object/vector by name, and it will behave as a vector in vectorized numerical operations (much as NumPy arrays do) and as an object in non-numerical operations. E.g., (Unit + enemy)/2 will calculate the midpoint between Unit and enemy. Two element lists, like [0,10] or [x,y] also participate like vectors in vectorized operations. Classes of variables are automatically detected, but a TypeError will be raised if the same variable switches classes (e.g., being used as a scalar in one place and a vector in another). The compiler somewhat intelligently chooses when to use sensor operations to retrieve up-to-date information about objects' .x and .y coordinates.

The following details will be irrelevant for most purposes, but occasionally could be useful, e.g. to shave off a few instructions.

A "vector" is anything with x and y coordinates, including mindustry blocks, mindustry units, and "pure vectors" which have no in-game analog, but also store an x,y pair. Vectorized operations typically produce pure vectors as their output.

For any mindustry object (block or unit) object.attr uses the mlog sensor sensor command to retrieve the current value of that @attribute from that object. This works for non-vector attributes like .health and .copper and similarly for vector attributes .x and .y (with the exception of the special object This, which instead compiles these to the global variables @thisx and @thisy to provide up-to-date coordinates without needing separate sensor instructions). For pure vectors, .x and .y still refer to their x and y coordinates, but these cannot (and hence will not) be updated by sensors.

For any vector (block, unit, or pure vector), vector_x and vector_y refer to the most recently known values for that vector's coordinates. Such underscore notation can be useful in cases where you don't want to waste instruction cycles using sensors to re-read @x and @y from an object whose old readings are close enough to current. This can also be used to efficiently change just the _x attribute of a pure vector, while leaving its _y unchanged.

The suffix xy refers to x and y coordinates together as a pure vector. So, vec.xy uses sensors (if available) to read up-to-date values into vec_x and vec_y, whereas vec_xy just uses the most recent values of these without updating them with sensor readings. Both of these behave in larger expressions as the pure vector [vec_x, vec_y], forcing the larger expression to be vectorized. By default lastUnit = Unit assigns lastUnit to refer to the currently bound Unit, so you could still refer to this unit even after binding a new one. In contrast, lastPos = Unit.xy assigns lastPos to be a pure vector storing the current [Unit.x, Unit.y], and it will retain that value even after this Unit moves and/or a new Unit is bound.

For objects (units/blocks), the suffix id forces non-vectorized operations. This is mostly useful for doing comparisons. By default, unit == 1 is vectorized to unit.x == 1 and unit.y == 1 which checks whether the unit is just in from the lower-left corner of the map. In contrast, unit.id == 1 will be True whenever unit refers to an actual unit (since mlog converts all non-null units to the number 1 for numerical operations), and False when it is instead null.

TelosTelos commented 3 years ago

I had been assuming that encountering set a b tells us that a and b are of exactly the same class, so I had been tracking "cohorts" (equivalence classes) of atoms linked by set relations, and forcing them all to end up being the most restrictive class that any of them ends up being known to be.

However, I'm now thinking that this may be too restrictive. E.g. it should be fine to have a midpoint(v1,v2) function that takes different flavors of vectors as inputs, so the fact that its v1 gets set to a PureVector in one call shouldn't preclude this from being set to an ObjectVector in some later call. We could even allow numbers like 0 as inputs, and broadcast them to vectors like [0,0]. (This all works fine for inline already, since it can make different vectorization decisions about different inline instances, but not yet for out-of-line functions.) I.e., we should allow some measure of duck-typing.

The key things that typing is supposed to do for us are (1) letting us know which ops need to be vectorized, and (2) provide some helpful error messages when things won't work. As far as (1) is concerned, for numerical ops, all that matters is vector vs scalar, not which flavor of vector we have. I think there are only two places where the compiler really cares about pure vs object. One is set, which copies components of pure vectors, and just copies pointers for object vectors. The duck-typing approach would allow "purev = objv" and even "purev = number" as perfectly fine vectorized ops, but would view "objv = purev" and "objv = number" as errors. The other place that cares is that the compiler strategically inserts sensor instructions to keep ObjectVectors up-to-date, whereas we don't want such instructions for PureVectors. If we want to allow any flexible variables to sometimes be set to Pure, sometimes to Object, they'll probably need to be classed as Pure themselves, and they won't be able to get updated by sensor readings at all.

Duck-typing makes entailment relations between atoms' classes be asymmetric. Encountering a = b tells us different things about types than encountering b = a does. It may be helpful to view set as meaning something like copy_perhaps_with_some_vectorization which isn't the default behavior of = in python (though settable python properties often overload = to behave more like this). Rather than storing equivalence-class cohorts, we could remember input/output pairs. Each atom class can specify the most general class it knows how to get copied from, atom.can_be_copied_from, and the most general class(es) that it can be copied to, atom.can_be_copied_to. The input/output pair a=b tells us that we can type-hint that a must be something b.can_be_copied_to and b must be something a.can_be_copied_from.

Thus far, I've been implementing implicit type hints as specifying the most general class that an atom must belong to. E.g. dst( v ) hints that v is definitely a vector, but not which flavor of vector it is yet. Unfortunately, if we allow duck-typing, set a b isn't quite so neat. E.g., a = 10 tells us that a must be something that Number(10).can_be_copied_to, but Numbers can be copied to both other Number-representing atoms and to PureVectors, but they cannot be copied to any more general class containing both of these. So, if we don't know anything about a beforehand, encountering a = 10 won't tell us which branch to go down, whereas if we knew that a was some flavor of vector, the fact that only of these branches is compatible with that tells us to go all the way down this branch and conclude that a is a PureVector. Messy, but doable I guess.

TelosTelos commented 3 years ago

Stepping back a bit, the only proposed exception to the general rule that "set a b means a and b are of the same class" is the case where a is a pure vector, which interprets set a b as being short for a vectorized version that copies values to a_x and a_y separately. So I could just handle this as one special case, and not worry about coming up with some more general solution that would allow other classes to effectively overload = as well.

This special case is easy to handle when I already know a is a pure vector, as I can just type-hint that b is numerical and be finished. It's a bit trickier when a is of unknown type and b is any numerical type (i.e. number or vector). In this case, a = b guarantees that a is either a pure vector or another instance of b's type, whatever that turns out to be, and it'll take a little care to keep track of both possibilities.