Closed susan-garry closed 1 year ago
On second thought, if we're exposing these special sets to users anyways, might it not be easier to just allow users to write
ParSet<int> s;
s.add(5);
return s;
?
Thanks Susan!
If this is beginning to look necessary, I'm all for it. I guess I'd just like to state the approach that I was hoping would work (which the above will reject/subsume).
Basically I was hoping that the to
of your emit-to
could always be inferred from context/typing.
Slipping into pseudocode, slash buggy Pollen, a little...
1. for node in nodes:
2. .
3. for step in node.steps:
4. step' = ...
5. emit step'
6. .
7. .
8. .
I've added line numbers and blank lines for the spots I want to talk about.
node'
and emit step'
into that node's step set. By a similar argument I will also set up an under-construction graph'
.emit node'
, which will go into the node set of graph'
.graph'
.Just to tie this off, looks like you totally have the green light on this!
I'll just write out the reservation @sampsyo had about making such output sets first-class values. We need to make it impossible for them to be parts of ordinary expressions, if-guards, reassignments, etc. Allowing this would open up a whole pandora's box. Consider:
output out_nodes(Node)
output out_nodes_evil(Node)
for node in graph.nodes:
if node.seq == "GATTACA":
out_nodes := out_nodes_evil
out_nodes.add(node)
Clearly this is a mess.
This is not to say that these output sets cannot appear internally. For instance:
output out_nodes_short(Node)
output out_nodes_long(Node)
for node in graph.nodes:
if len(node.seq) < 10:
out_nodes_short.add(node)
else:
out_nodes_long.add(node)
Thank you all for your input! We had a productive discussion at today's meeting where we agreed that emit-to
statements are the way to go, specifically because
Paths
, Nodes
, and Edges
that we want to emit to an output graph versus those that we want to output directly to the user;I think @anshumanmohan's first comment also raises an interesting question about how to generate steps for an output graph. We can force users to simply add steps to an ordinary steps': Set<Step>
, and then just emit { node with steps = steps'
for each node in the graph. I don't know of any odgi functions which compute each element of steps'
in parallel, so perhaps this solution is sufficient for most graph analyses. I think the alternative would be to allow users emit steps to step'
and then also use step'
as a first-class value, but as discussed today this is more of a stretch goal and I think we should stick with forcing steps'
to be an ordinary Set
of steps.
Of course, to fully support this, we probably want to extend pollen with support for creating and building Set
s.
Sticking to the original intend of this issue, while we're in agreement on the use of emit-to
statements, we hadn't agreed to a syntax for initializing sets. These are my prelimary ideas and thoughts as to what our options are:
output set = OutputSet(Node);
This seems fine, but we've also noted that in the future, we might not output every set that is emitted to. In the interest of backwards compatibility, then, I think that ideally the keyword should not be "output". So far I haven't come up with a better one, and would appreciate any suggestions! The best I can come up with is
emittee set = Emittee(Node);
or
emitto set = Emitto(Node);
If we do use special keywords, I can think of three possible ways to specify if a set is part of an output graph:
outgraph g1; emitto s = Emitto(Node, g1);
, oremitto s1 = Emitto(int); emitto s2 = Emitto(Node, graph1);
emitto s1 = Emitto(Node, -1); emitto s2 = Emitto(Node, 1);
I think one downside to not initializing the sets at all is that in the future, we might want to allow their variable identifiers to be used as first class values in specific instances, such as computing a set of steps in parallel and then outputting a node with the new set of steps as described in my last comment. In these cases, it seems somewhat unintuitive to use emit-to
statements to also initialize these sets. However, the current alternatives aren't exactly the most intuitive either; one drawback of the keyword examples above is that it looks a bit like initializing a variable whose type comes before its identifier.
Does anyone have thoughts about these options, or ideas for other possibilities?
Looks like this spawned two separate issues, so I've created a separate issue for us to discuss the need for Set
. Let's keep this thread for the other issue.
If you want a word other than output
, how about transmittable
? As in, capable of being transmitted, but not presumed to be. And then you can transmit
a thing that was previously declared transmittable
. It's a long word, but ah well. No more output
...
My vote would have been not to require the initialization of output sets. However, this will probably limit us when it comes to multiple output sets of the same flavor.
Then we're left with special keywords.
emit e to s in g
statements: same concern.I think if they specify no graph, they are creating one graph. If they want multiple, they must name them all. Here I'm sorta channeling https://en.wikipedia.org/wiki/Zero_one_infinity_rule.
Separately, if they want to transmit their transmittable
, they must say so as suggested above. I think this takes care of your 0/1 flags in option (iii).
Finally, I'm cool with either (i) or (ii) stylistically! Slight preference for (i), but not a big deal.
I like this style versus the others because it's like setting up an outchannel
to write to a file. You specify which file, once, and get a handle to the outchannel. Then you can pass that handle around with the knowledge that it is piped to the right place.
My vote would have been not to require the initialization of output sets. However, this will probably limit us when it comes to multiple output sets of the same flavor.
Maybe it would be more accurate to describe the first option as initializing sets as part of the emit-to
statements. So we would still name each output set, but instead of
emitto s1 = Emitto(int);
emittos2 = Emitto(int);
emit 4 to s1;
emit 2 to s2;
We would get rid of the first two lines and just say
emit 4 to s1;
emit 2 to s2;
The compiler would then infer that s1
and s2
are different sets, allocate separate space for each, and place 4 in s1
and 2 in s2
.
Right, but I think this would get messy quickly once we got into multiple graphs. You'd be saying
emit foo to s1 in g1;
emit bar to s2 in g2;
and then accidentally saying
emit baz to s1 in g2;
would set off the alarms.
The option of yours that I am most fond of avoids this: you set up the connections between s1
and g1
, respectively s2
and g2
, once, and then you get to add stuff to s1
or s2
with the knowledge that they are going to the right places.
outgraph g1;
outgraph g2;
s1 = Transmittable(Node, g1);
s2 = Transmittable(Node, g2);
// setup done
emit foo to s1;
emit bar to s2;
emit baz to s1;
...
transmit s1
// and not s2
In hindsight, I do think emit e to s in g
is weird because each set belongs to a fixed set of graphs, so it doesn't make sense to re-declare what this graph is every time we emit to the set.
Separately, if they want to transmit their transmittable, they must say so as suggested https://github.com/cucapra/pollen/issues/57#issuecomment-1517170318. I think this takes care of your 0/1 flags in option (iii).
I agree that 0/1 flags would not be an issue in the case, and that's definitely a good thing. I'm tempted to remove all of the options that you've already showed to be demonstrably bad to make less reading for the next person who dares enter this discussion thread.
--- Mostly Resolved ---
I think if they specify no graph, they are creating one graph. If they want multiple, they must specify them all. Here I'm sorta channeling https://en.wikipedia.org/wiki/Zero_one_infinity_rule.
I'm still a bit confused on what you are suggesting here. I think that perhaps what you're suggesting is that 'special sets' containing Nodes, Paths, or Edges be automatically emitted to an output graph, and that a user must specify when they instead want such a set to just be outputted to the user. I just don't think we should introduce new syntax for this. Since we already need a special syntax for emitting to more than one output graph, it makes sense to me that we can just reuse this syntax to describe emitting to a single output graph, rather than invent new syntax to describe when we're not emitting to any graph. Does this make sense, or have I completely misunderstood your point?
Finally, I'm cool with either (i) or (ii) stylistically! Slight preference for (i), but not a big deal.
I like this style versus the others because it's like setting up an
outchannel
to write to a file. You specify which file, once, and get a handle to the outchannel. Then you can pass that handle around with the knowledge that it is piped to the right place.
Makes a lot of sense to me! You've convinced me on this one.
Sounds good Susan! I was just trying to get you a way to craft a set without, by default, outputting it. I actually tweaked my quoted comment for clarity, I think while you were writing your note above. Could you see if the current version looks any better?
I think that makes a bit more sense - I'm not sure what exactly your suggestion is for specifying when a set should be outputted to the user. I think that's more of a concern for the next version of pollen, but I assumed we would probably just use return statements, perhaps restricting them so that they can only occur at the end of a function (and therefore not within any loops).
Sounds good, y'all! To summarize what I think the main options are, I think we have:
emit expr to s
statements, and for each distinct s
, creates an output set.emit expr to s
, the user has to do something like out s(type)
for some element type type
to declare the name s
.There is a separate issue about how to associate sets with output graphs, but that's both a later-phase thing and somewhat (although not entirely) orthogonal. The explicit option would make this particularly easy to handle.
I didn't quite grok which of these we've settled on. I think it's 1 (implicit)—is that right? (TBH, both seem good to me!)
I think we've agreed that the explicit option is the way to go, as this makes it clearer what variable names are floating around in the program environment and will make it easier to associate sets with output graphs when we get there.
I think the syntax that I prefer the most is graph g; parset s[type, graph];
or parset s: type, graph
, by process of elimination. Something like outset s = Outset(Type)
looks too much like an object initialization, with outset
as the type of s
, but these sets don't have an explicit type. Additionally, the outset
keyword doesn't feel quite right because it implies that s
will be output, but in the future we might want to allow emitting to sets that aren't output. In this case, the outset
keyword would be misleading, and if we change it then we lose backward compatibility. I also just like "parset" because it sounds like "parsec," which hopefully is more charming than confusing.
All sounds good! Just to throw them out there, other keyword options include set
and out
. But this all seems good to me!
@susan-garry if we are happy with this, shall we close it out and make it a discussion?
Currently, the emit statement as we've been using it is ambiguous.
emit e
addse
to a special set whose interface is restricted such that we can safely compute its elements independently (in theory - but that's a separate issue). However, it's not clear what set we are addinge
to. If we want to simply output one set of values then this syntax is unambiguous because there is only one possible set that we could emite
to, but if we want to emit multiple sets or output entirely new graphs, things get a little complicated. Each graph consists of multiple sets to begin with, so we want a way to alter, e.g.,node1.steps
andnode2.steps
, we need a way to emit to both of these sets. We also want a way to distinguish between emitting, say, a node to an output set vs. emitting a node to a new graph.I think the easiest way to deal with this ambiguity is to require the user to specify which set they are emitting a value to, such as with
emit-to
statements:emit e to s
. This means users must also be able to initialize special sets likes
. Because these aren't ordinary sets and we may want to support ordinary sets further down the line, I think we should probably give these sets a unique name and type, such asParSet<>
orPolSet<>
.There is still the issue of differentiating between when a set is being output directly to the user or modifying an existing graph. I suspect that the easiest way to do this will be to define a specific nomenclature for
ParSet
s that modify an existing graph, e.g.node_output
is the set of nodes in our new graph and if a user writesemit e to node_output
, we know that we will be emitting a new graph. This nomenclature could be extended to allow a user to output multiple modified graphs by simply indexing these names, sonode_output1
andnode_output2
would each modify the original graph and produce a new one.I'm admittedly not a huge fan of defining a special nomenclature since these variable names aren't necessarily intuitive to everyone and it would be better if we could allow users to define their own variable names in as many instances as possible, but the main point of this issue is that emit statements are ambiguous and too restrictive to support everything that we want
Pollen
to. We could supportemit
statements in addition toemit-to
statements, but I think this just creates further complications and it makes the most sense to me to start withemit-to
statements and addemit
statements later on if we decide it would be beneficial.