feature wish/discussion: having complex size as observable

yarden commented 5 years ago

This is loosely related to #599 and something I've discussed briefly with @pirbo in the past. I'm putting it here for posterity/discussion.

There are some non-local properties that would be extremely useful to have as observables, such as complex size. Imagine a Kappa program that produces polymers (like this one). To measure polymer sizes through time, one has to either: (1) dump many snapshots and parse them, or (2) dump a trace and query it. Both are generally expensive operations, time and storage wise. It'd be very handy to be able to say something like:

%obs: 'poly_sizes' sizes(C(x[.],y))

I imagine sizes(expr) to mean something like: get the sizes (a vector) of every complex that matches expr. In the output (data.csv), this entry could appear as a list, e.g.:

[T]   'poly_sizes'
0.0   [1,1,1,1]
0.1   [1,2,1]
0.2   [1,3]
0.3   [1,1,2]
0.4   [1,1,1,1]
...

(I realize it's a bit ugly, but csv is an ugly format anyway...)

My understanding is that this would be highly non-trivial to implement because it requires keeping track of a non-local state of the mixture. But I wonder if there's a way to restrict expr in a way that would make it easier?

The trace query language can presumably do this, but this query seems common enough that it'd be useful to have as built-in, without needing to output traces.

Thanks, Yarden

hmedina commented 5 years ago

I would argue there are two types of measure here, one is local and another is global.

At the local level: When you ask the mixture "how many patterns X are there", in the form of an observable, you get the number of embeddings. This is not a global answer: it doesn't account for symmetries. If X has symmetries (other than the trivial one), e.g. Bob(site[1]), Bob(site[1]) the observable will report the number of times that embedding happened, which is twice the number of Bob dimers in the mixture. Thus making this query as an observable can yield very nasty responses. Querying the size(Bob()), which is to mean "the size, in agents, of any graph that contains agent Bob()", when the mixture contains a Bob-homodecamer would yield [10, 10, 10, 10, 10, 10, 10, 10, 10, 10] for that decamer alone, which is not what (I imagine) the user wants or expects. In fact, disentangling such a response quickly becomes non-trivial when one has multiple types of agent. Imagine I have a Bob-Mary-system where Bob and Mary can homopolymerize, and also co-polymerize with each other, and my size(Bob()) vector looks like [3, 3, 3, 3]; is there a Bob-homotrimer and a Bob-Mary-heterotrimer with one Bob? Or maybe four Bob-Mary-heterotrimers with one Bob each?!

To me, the question only makes sense at the state level: Dump the state as a snapshot, and you force the realization of every agent, every complex, every internal and bond state, and by nature of writing to file, to stream, to a linear thing, you force the symmetry break. Stuff appears in order. Now the question of "the size, in agents, of any complex that contains agent Bob()" is easy to answer. I do it with KaSaAn frequently ;) You can dump a snapshot with the same frequency of plot points and get the global pictures at those times, getting the expected functionality as if it were a magic observable. Or if you want omniscient powers, to know all about everything at everytime, the trace can be queried (assuming it fits in disk... see #599 )

Having state-level information be reported in the data.csv file would be of some use, don't get me wrong, specially when the user doesn't want to trouble themselves with parsing snapshots (e.g. for quick prototyping). But it would not be an observable, and I would argue it should not be called one. Inventing a new kind of datum to report, to be taken at "plotPeriod" and written to the time-traces file, therefore requires IMHO a new keyword, say %rep: for "report". And what else should be reported in addition to state-level size of complexes containing some pattern? We can take the function list from the TQL. For example, add the report back on "the size, in Mary(state{some}), of any complex containing Bob()", or "the Kappa expression of all complexes containing Bob(state{other})". With proper quoting, the CSV commas shouldn't clash with the Kappa expression's.

Ultimately, all this functionality already exists with tools available outside KaSim. Would it be nice to have all that be present within KaSim itself? Yes it would, specially as a quality-of-modeler-life improvement. Would it be more reliable to have all that external code migrate into KaSim so the KaSim-Dev gets more projects to maintain and those external repos lose their volunteer maintainers? [insert French expletive here]

Anyhow, those are my 2 cents on this discussion.

P.S. as a discussion, shouldn't it be on the Kappa-Users forum?

yarden commented 5 years ago

To me, the question only makes sense at the state level: Dump the state as a snapshot, and you force the realization of every agent, every complex, every internal and bond state, and by nature of writing to file, to stream, to a linear thing, you force the symmetry break. Stuff appears in order. Now the question of "the size, in agents, of any complex that contains agent Bob()" is easy to answer. I do it with KaSaAn frequently ;) You can dump a snapshot with the same frequency of plot points and get the global pictures at those times, getting the expected functionality as if it were a magic observable.

I do this routinely as well by dumping snapshots, but this is precisely what my proposal is meant to avoid - because dumping the state is hardly magical ;). Dumping a snapshot for every plot point is impractical in many cases. I run simulations of Kappa that produce millions of events and so to observe the system at every event, Kappa would output millions of separate JSON files (which as we discussed are large), which I would then need to postprocess. However, if all we're interested in is length, keeping track of the length of some complexes and outputting a vector with a million numbers is much easier to work with. The snapshot sometimes contains way more information than the user wants.

Kappa-Dev / KappaTools

feature wish/discussion: having complex size as observable #600