cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
24 stars 1 forks source link

`Emit` Statement Ambiguity #57

Closed susan-garry closed 1 year ago

susan-garry commented 1 year ago

Currently, the emit statement as we've been using it is ambiguous. emit e adds e to a special set whose interface is restricted such that we can safely compute its elements independently (in theory - but that's a separate issue). However, it's not clear what set we are adding e to. If we want to simply output one set of values then this syntax is unambiguous because there is only one possible set that we could emit e to, but if we want to emit multiple sets or output entirely new graphs, things get a little complicated. Each graph consists of multiple sets to begin with, so we want a way to alter, e.g., node1.steps and node2.steps, we need a way to emit to both of these sets. We also want a way to distinguish between emitting, say, a node to an output set vs. emitting a node to a new graph.

I think the easiest way to deal with this ambiguity is to require the user to specify which set they are emitting a value to, such as with emit-to statements: emit e to s. This means users must also be able to initialize special sets like s. Because these aren't ordinary sets and we may want to support ordinary sets further down the line, I think we should probably give these sets a unique name and type, such as ParSet<> or PolSet<> .

There is still the issue of differentiating between when a set is being output directly to the user or modifying an existing graph. I suspect that the easiest way to do this will be to define a specific nomenclature for ParSets that modify an existing graph, e.g. node_output is the set of nodes in our new graph and if a user writes emit e to node_output, we know that we will be emitting a new graph. This nomenclature could be extended to allow a user to output multiple modified graphs by simply indexing these names, so node_output1 and node_output2 would each modify the original graph and produce a new one.

I'm admittedly not a huge fan of defining a special nomenclature since these variable names aren't necessarily intuitive to everyone and it would be better if we could allow users to define their own variable names in as many instances as possible, but the main point of this issue is that emit statements are ambiguous and too restrictive to support everything that we want Pollen to. We could support emit statements in addition to emit-to statements, but I think this just creates further complications and it makes the most sense to me to start with emit-to statements and add emit statements later on if we decide it would be beneficial.

susan-garry commented 1 year ago

On second thought, if we're exposing these special sets to users anyways, might it not be easier to just allow users to write

ParSet<int> s;
s.add(5);
return s;

?

anshumanmohan commented 1 year ago

Thanks Susan!

If this is beginning to look necessary, I'm all for it. I guess I'd just like to state the approach that I was hoping would work (which the above will reject/subsume).

Basically I was hoping that the to of your emit-to could always be inferred from context/typing.

Slipping into pseudocode, slash buggy Pollen, a little...

1.    for node in nodes:
2.    . 
3.        for step in node.steps:
4.            step' = ...
5.            emit step' 
6.        .
7.    .
8.    .

I've added line numbers and blank lines for the spots I want to talk about.

  1. We start to iterate over the input graph's node set.
  2. Nothing for now.
  3. We start to iterate over the present node's step set.
  4. We've constructed something derived from the present set.
  5. Aha! So you're in the node-step-emiting business? That must mean that you're also in the node-emitting business. Like, way back on line 2 you were in the node-emitting business. I will set up an under-construction node' and emit step' into that node's step set. By a similar argument I will also set up an under-construction graph'.
  6. Done with the loop over steps; great.
  7. Done with the loop over nodes? I'll infer that you want to emit node', which will go into the node set of graph'.
  8. I will infer that you want to emit graph'.
anshumanmohan commented 1 year ago

Just to tie this off, looks like you totally have the green light on this!

I'll just write out the reservation @sampsyo had about making such output sets first-class values. We need to make it impossible for them to be parts of ordinary expressions, if-guards, reassignments, etc. Allowing this would open up a whole pandora's box. Consider:

output out_nodes(Node)
output out_nodes_evil(Node)

for node in graph.nodes:
    if node.seq == "GATTACA":
        out_nodes := out_nodes_evil
    out_nodes.add(node)

Clearly this is a mess.

This is not to say that these output sets cannot appear internally. For instance:

output out_nodes_short(Node)
output out_nodes_long(Node)

for node in graph.nodes:
    if len(node.seq) < 10:
        out_nodes_short.add(node)
    else:
        out_nodes_long.add(node)
susan-garry commented 1 year ago

Thank you all for your input! We had a productive discussion at today's meeting where we agreed that emit-to statements are the way to go, specifically because

  1. It will allow us to disambiguate between sets of Paths, Nodes, and Edges that we want to emit to an output graph versus those that we want to output directly to the user;
  2. It can be used to output more than one graph per pollen file.

I think @anshumanmohan's first comment also raises an interesting question about how to generate steps for an output graph. We can force users to simply add steps to an ordinary steps': Set<Step>, and then just emit { node with steps = steps' for each node in the graph. I don't know of any odgi functions which compute each element of steps' in parallel, so perhaps this solution is sufficient for most graph analyses. I think the alternative would be to allow users emit steps to step' and then also use step' as a first-class value, but as discussed today this is more of a stretch goal and I think we should stick with forcing steps' to be an ordinary Set of steps.

Of course, to fully support this, we probably want to extend pollen with support for creating and building Sets.

susan-garry commented 1 year ago

Sticking to the original intend of this issue, while we're in agreement on the use of emit-to statements, we hadn't agreed to a syntax for initializing sets. These are my prelimary ideas and thoughts as to what our options are:

  1. Don't. That is, a user isn't required to initialize the output sets. Instead, a user can simply write emit e to s and the compiler will infer that it needs to allocate space for the set s and emit objects to it. In this case, we could
    1. Use a naming scheme to determine which sets are part of output graphs and which output graph they belong to, or
    2. Bad idea as discussed in https://github.com/cucapra/pollen/issues/57#issuecomment-1517196414 Extend the language with emit e to s in g statements, where g is the name of an output graph (that, as with s, perhaps doesn't even require initialization).
  2. Use special keyword(s). At the meeting today, Adrian proposed something like output set = OutputSet(Node); This seems fine, but we've also noted that in the future, we might not output every set that is emitted to. In the interest of backwards compatibility, then, I think that ideally the keyword should not be "output". So far I haven't come up with a better one, and would appreciate any suggestions! The best I can come up with is emittee set = Emittee(Node); or emitto set = Emitto(Node);

If we do use special keywords, I can think of three possible ways to specify if a set is part of an output graph:

  1. As above, use a naming scheme to determine which sets are part of output graphs and which output graph they belong to, or
  2. Bad idea; see (ii) above As above, extend the language with emit e to s in g statements, or
  3. Require that the user specify the (perhaps optional?) output graph as part of initializing a set. This could mean
    1. Initializing the output graph first, as in outgraph g1; emitto s = Emitto(Node, g1);, or
    2. Not initializing output graphs and simply naming them on the fly, as we do with sets in Option 1: emitto s1 = Emitto(int); emitto s2 = Emitto(Node, graph1);
    3. Bad idea, too convoluted Numbering the output graph, with some default like -1 or 0 to indicate that a set doesn't belong to an output graph. For example: emitto s1 = Emitto(Node, -1); emitto s2 = Emitto(Node, 1);

I think one downside to not initializing the sets at all is that in the future, we might want to allow their variable identifiers to be used as first class values in specific instances, such as computing a set of steps in parallel and then outputting a node with the new set of steps as described in my last comment. In these cases, it seems somewhat unintuitive to use emit-to statements to also initialize these sets. However, the current alternatives aren't exactly the most intuitive either; one drawback of the keyword examples above is that it looks a bit like initializing a variable whose type comes before its identifier.

Does anyone have thoughts about these options, or ideas for other possibilities?

anshumanmohan commented 1 year ago

Looks like this spawned two separate issues, so I've created a separate issue for us to discuss the need for Set. Let's keep this thread for the other issue.

anshumanmohan commented 1 year ago

If you want a word other than output, how about transmittable? As in, capable of being transmitted, but not presumed to be. And then you can transmit a thing that was previously declared transmittable. It's a long word, but ah well. No more output...

anshumanmohan commented 1 year ago

My vote would have been not to require the initialization of output sets. However, this will probably limit us when it comes to multiple output sets of the same flavor.

Then we're left with special keywords.

  1. Naming Scheme: feels like it would little the code with lots of variables that the user will need to be rather careful about all the time.
  2. Extend with emit e to s in g statements: same concern.
  3. User specifies the output graph when initializing the set: this is my preference; I'll say more below.

I think if they specify no graph, they are creating one graph. If they want multiple, they must name them all. Here I'm sorta channeling https://en.wikipedia.org/wiki/Zero_one_infinity_rule.

Separately, if they want to transmit their transmittable, they must say so as suggested above. I think this takes care of your 0/1 flags in option (iii).

Finally, I'm cool with either (i) or (ii) stylistically! Slight preference for (i), but not a big deal.

I like this style versus the others because it's like setting up an outchannel to write to a file. You specify which file, once, and get a handle to the outchannel. Then you can pass that handle around with the knowledge that it is piped to the right place.

susan-garry commented 1 year ago

My vote would have been not to require the initialization of output sets. However, this will probably limit us when it comes to multiple output sets of the same flavor.

Maybe it would be more accurate to describe the first option as initializing sets as part of the emit-to statements. So we would still name each output set, but instead of

emitto s1 = Emitto(int);
emittos2 = Emitto(int);
emit 4 to s1;
emit 2 to s2;

We would get rid of the first two lines and just say

emit 4 to s1;
emit 2 to s2;

The compiler would then infer that s1 and s2 are different sets, allocate separate space for each, and place 4 in s1 and 2 in s2.

anshumanmohan commented 1 year ago

Right, but I think this would get messy quickly once we got into multiple graphs. You'd be saying

emit foo to s1 in g1;
emit bar to s2 in g2;

and then accidentally saying

emit baz to s1 in g2;

would set off the alarms.

The option of yours that I am most fond of avoids this: you set up the connections between s1 and g1, respectively s2 and g2, once, and then you get to add stuff to s1 or s2 with the knowledge that they are going to the right places.

outgraph g1; 
outgraph g2; 
s1 = Transmittable(Node, g1);
s2 = Transmittable(Node, g2);
// setup done

emit foo to s1;
emit bar to s2;
emit baz to s1;
...

transmit s1
// and not s2
susan-garry commented 1 year ago

In hindsight, I do think emit e to s in g is weird because each set belongs to a fixed set of graphs, so it doesn't make sense to re-declare what this graph is every time we emit to the set.

Separately, if they want to transmit their transmittable, they must say so as suggested https://github.com/cucapra/pollen/issues/57#issuecomment-1517170318. I think this takes care of your 0/1 flags in option (iii).

I agree that 0/1 flags would not be an issue in the case, and that's definitely a good thing. I'm tempted to remove all of the options that you've already showed to be demonstrably bad to make less reading for the next person who dares enter this discussion thread.

--- Mostly Resolved ---

I think if they specify no graph, they are creating one graph. If they want multiple, they must specify them all. Here I'm sorta channeling https://en.wikipedia.org/wiki/Zero_one_infinity_rule.

I'm still a bit confused on what you are suggesting here. I think that perhaps what you're suggesting is that 'special sets' containing Nodes, Paths, or Edges be automatically emitted to an output graph, and that a user must specify when they instead want such a set to just be outputted to the user. I just don't think we should introduce new syntax for this. Since we already need a special syntax for emitting to more than one output graph, it makes sense to me that we can just reuse this syntax to describe emitting to a single output graph, rather than invent new syntax to describe when we're not emitting to any graph. Does this make sense, or have I completely misunderstood your point?

Finally, I'm cool with either (i) or (ii) stylistically! Slight preference for (i), but not a big deal.

I like this style versus the others because it's like setting up an outchannel to write to a file. You specify which file, once, and get a handle to the outchannel. Then you can pass that handle around with the knowledge that it is piped to the right place.

Makes a lot of sense to me! You've convinced me on this one.

anshumanmohan commented 1 year ago

Sounds good Susan! I was just trying to get you a way to craft a set without, by default, outputting it. I actually tweaked my quoted comment for clarity, I think while you were writing your note above. Could you see if the current version looks any better?

susan-garry commented 1 year ago

I think that makes a bit more sense - I'm not sure what exactly your suggestion is for specifying when a set should be outputted to the user. I think that's more of a concern for the next version of pollen, but I assumed we would probably just use return statements, perhaps restricting them so that they can only occur at the end of a function (and therefore not within any loops).

sampsyo commented 1 year ago

Sounds good, y'all! To summarize what I think the main options are, I think we have:

  1. Implicit: To determine the set of output sets, the compiler looks for all emit expr to s statements, and for each distinct s, creates an output set.
  2. Explicit: Before a user can do emit expr to s, the user has to do something like out s(type) for some element type type to declare the name s.

There is a separate issue about how to associate sets with output graphs, but that's both a later-phase thing and somewhat (although not entirely) orthogonal. The explicit option would make this particularly easy to handle.

I didn't quite grok which of these we've settled on. I think it's 1 (implicit)—is that right? (TBH, both seem good to me!)

susan-garry commented 1 year ago

I think we've agreed that the explicit option is the way to go, as this makes it clearer what variable names are floating around in the program environment and will make it easier to associate sets with output graphs when we get there.

I think the syntax that I prefer the most is graph g; parset s[type, graph]; or parset s: type, graph, by process of elimination. Something like outset s = Outset(Type) looks too much like an object initialization, with outset as the type of s, but these sets don't have an explicit type. Additionally, the outset keyword doesn't feel quite right because it implies that s will be output, but in the future we might want to allow emitting to sets that aren't output. In this case, the outset keyword would be misleading, and if we change it then we lose backward compatibility. I also just like "parset" because it sounds like "parsec," which hopefully is more charming than confusing.

sampsyo commented 1 year ago

All sounds good! Just to throw them out there, other keyword options include set and out. But this all seems good to me!

anshumanmohan commented 1 year ago

@susan-garry if we are happy with this, shall we close it out and make it a discussion?