Data collection interface inspired by DataFrames.jl

JuliaDynamics / Agents.jl

Agent-based modeling framework in Julia

https://juliadynamics.github.io/Agents.jl/stable/

MIT License

763 stars 125 forks source link

Data collection interface inspired by DataFrames.jl #408

Closed flipgthb closed 3 years ago

flipgthb commented 3 years ago

Hi, First, I would like to say that I appreciate the work and that I'm way more comfortable with Agents.jl than with Mesa or other approaches I've tried for agent modeling, so thanks a lot for the great job. One thing, however, that I think could be improved is the data collection interface. Not that it is bad right now, just that think it could be even more convenient. One approach I really appreciate regarding data manipulation is the DataFrames.jl package interface to pass functions to apply to dataframes columns, like:

using DataFrames
df = DataFrame(a=collect(1:6),b=rand([:x,:y],6),c=randn(6))
    select(df,[:a,:b]=>ByRow((a,b)->"$(b)_$(a)")=>:new_col_from_a_and_b)

Two features of this approach that, in my opinion, Agents.jl approach lacks are:

The possibility to use selectors to specify how to use the attributes (in the above example the ByRow).
The ability to use anonymous functions and give it a name on the fly, which helps to clean the global namespace from functions that will be used only once.

I propose for Agents.jl to adapt its data collection interface to follow the approach of DataFrames.jl. Sure, Agents.AgentBasedModel does not follow the Tables.jl interface, so the "selectors" should be different.

The idea looks somewhat like this:

run!(
    fake_model_instance,
    fake_agent_step!,
    n_steps;
    report=[
    [:agent_property,:agent_other_property]=>ByAgent((x,y)->x+y)=>:agent_complicated_property,
    :agent_property=>Neighbors(((a1prop,a2prop)->max(a1prop,a2prop)),r=1)=>:pairwise_property,
    ],
    report_when=[:agent_complicated_property=>x->mean(x)<1.0]
)

Forgive me if what I said is to much nonsense (not a programmer and never requested a feature before).

Libbum commented 3 years ago

Hi Felippe!

An interesting proposal, although I'm not convinced yet that it makes the most sense for our collection model. Granted, there are a lot of 'simple' examples in our documentation, so we have yet to really considered scenarios that may require greater flexibility. If you have one, it would be helpful to take a look at it.

Some things we need to decide upon:

Does this change keep the interface simple when needed but give power users the capacity they need? At the moment I cannot see how I just store :agent_property directly for example.
Can this be implemented by exposing more of the DataFrames ecosystem or do we need to re-implement it? I've personally never used some of this syntax, so I'd need to do some reading into the API to be sure.
Are there performance issues we need to consider? We've done a lot of benchmarking on the current methods and they're pretty fast and lean. How will these changes affect that. I think most of what you're suggesting is just syntax changes, more on that below.
I'm quite confused about the Neighbors example. How should I interpret this? It looks like you take some property, search for distances between two copies of that property and find the maximum? It seems like a lot of effort and a lot of syntax to learn if you're unfamiliar.

Those are some open questions. To address your two points though, I'm not sure they are exactly correct.

In the ByRow example, the select method is done on an existing dataframe. Since the current run! returns dataframes—this action is currently possible to do as a post-process.

The anonymous function & global namespace issue is something that I really didn't like myself and we had extensive debates about it in #191 when the latest re-design happened. I was advocating for a dictionary option, but that caused problems of its own. As a solution to this annoyance, its possible to wrap the collect calls in a function:

function assets(model)
    trees(model) = model.orchard.apple_trees + model.orchard.peach_trees                                                                                                  
    cows(model) = model.farm.cows                                                                                                                                         
    tractors(model) = model.farm.tractors                                                                                                                                 
    cash(model) = model.bank_balance - model.loans
    [trees, cows, tractors, cash]
end

julia> Agents.collect_model_data(model, assets(model), 1)
1×5 DataFrame
│ Row │ trees │ cows  │ tractors │ cash    │ step  │
│     │ Int64 │ Int64 │ Int64    │ Float64 │ Int64 │
├─────┼───────┼───────┼──────────┼─────────┼───────┤
│ 1   │ 884   │ 242   │ 3        │ 69000.7 │ 1     │

As a consequence of such a syntax, it looks like the only extra thing that's not addressed yet is report_when. This functionality is already accessible by the when keyword of run!.

From a purely programmatic point of view, all of this is syntax sugar. Since the model only stores the current time step and is mutated every call to run!, the only way we can actually build the dataframe is one row at a time. Actions like [:agent_property,:agent_other_property]=>ByAgent((x,y)->x+y)=>:agent_complicated_property need to be converted to agent.x + agent.y then pushed into the :agent_complicated_property column. The suggested syntax has a lot of doubling up, how is it more expressive & capable than agent_complicated_property(a) = a.x + a.y?

I totally forgot about this assets function until now—it should have been put into an example when we were finalising the collection methods, so at the very least that should be added to the docs. From the looks of it though, we may already have the capabilities you're looking for, just not explained enough through appropriate examples and documentation.

On the other hand, if users are well versed in DataFrames, then perhaps they would welcome the same syntax. I'm open to persuasion, but perhaps this makes our current implementation a little clearer to you.

flipgthb commented 3 years ago

Regarding your four points:

Does this change keep the interface simple when needed but give power users the capacity they need? At the moment I cannot see how I just store :agent_property directly for example.

If possible either to leverage or mimic the DataFrames.jl interface then possibly yes, as the it accepts column names and functions to process the selections

Can this be implemented by exposing more of the DataFrames ecosystem or do we need to re-implement it? I've personally never used some of this syntax, so I'd need to do some reading into the API to be sure.

Probably not, since the selectors serve as high level tools for manipulating the Tables.jl interface. Maybe if we could "view" an AgentBasedModel as a table, but I don't think treating ABMs as tables would be intuitive. Thats why I suggested different selectors: with the current interface, I need either to save my agents properties as whole and compute the quantities I'm really interest at the analysis level, using the DataFrames.jl (or whatever can deal with a DataFrame), or I need to implement many functions to collect only what I'm interested in each simulation (I'll try to illustrate it better below)

Are there performance issues we need to consider? We've done a lot of benchmarking on the current methods and they're pretty fast and lean. How will these changes affect that. I think most of what you're suggesting is just syntax changes, more on that below.

Honestly, I don't know about performance at that level, but I have mixed feeling about this: one one hand for me, the user of Agents.jl with no competence to build optimized functions to collect data, if the selectors (like the illustrative ByAgent and Neighbors) could serve as optimized building blocks for data collection I would be very happy. On the other hand I know this is selfish from my part, since I'm moving the job of optimizing a more general and hard problem to you, instead of learning myself. So, it would be just syntax changes, but a kind of DSL for data collection

I'm quite confused about the Neighbors example. How should I interpret this? It looks like you take some property, search for distances between two copies of that property and find the maximum? It seems like a lot of effort and a lot of syntax to learn if you're unfamiliar.

When I suggested the Neighbors I thought of it as an encapsulated loop over neighbors that applies the function, this case max between for the agents properties passed in the list of keys (it should have been maximum in the example. numpy fingers...). My initial thoughts on selectors for agents models were to extract the core of data collection and expose it as syntax. For instance, I Statistical Physics, when we think about the order parameters to be collected from a model's state, we don't think [maximum(agent1.agent_property,agent2.agent_property) for agent1 in model, agent2 in nearby_agents(agent1,model)], but something like "what is maximum :agent_property in this each agent neighborhood". This ways of thinking require different abstraction, the former requires me to translate a microscopic (local to each agent) into a macroscopic (need to evaluate the for all agents), the later requires me to be able to access the model from an agents perspective, which is not possible in Agents.jl. The idea of the selectors for agents is, in my dumb uninformed imagination is to allow local reasoning were its is more intuitive and global reasoning when more appropriate.

The report_when part is just a leap into the possibility of specifying the data collection points not per adata or mdata but depending on the observable itself. I just threw it there with no proper thought about, sorry.

As a matter of fact I didn't know about the approach used in this assets function, I tested with a simplified version of my model and it works. When I first read it wasn't clear how to make it work with paramscan, since I don't have access to the model instance in, but I can just use the create a function assets() instead of assets(model) keyword following the same strategy, pass the result of it to the mdata in paramscan and works the same, at least in a simple test model. This effectively solves the global namespace issue part of my proposal.

I'll write another comment with a more concrete example related to my work, but now I agree that all I suggested seems only syntax and kinda pointless to pursue. Anyway, thanks for answering

flipgthb commented 3 years ago

Just to illustrate why I would wish more flexibility on data collection, I'll describe the type of model I'm working on. The agents in my model are simple neural networks, in principle they may have any architecture, but in my present case, they have "2 hidden layer with K and N hidden nodes, respectievely". Each layer is represented by a weight vector and a covariance matrix, which in my learning algorithm plays the role of self tuning learning rates. Something like this:

@agent NNAgent{K,N} GraphAgent begin
    w1::MVector{K,Float64}
    C1::MMatrix{K,K,Float64}
    w2::MVector{N,Float64}
    C2::MMatrix{N,N,Float64}
end

I'm interested in a few functions like similarity(agent1,agent2) = dot(agent1.C1*agent1.w1, agent2.C1*agent2.w1) and analogous for w2 and C2, or frustration(a1,a2,a3)=similarity(a1,a2)*similarity(a1,a3) + similarity(a2,a3)*similarity(a2,a1) ..., and simpler ones that are easily covered with the current data collection approach. Now, these examples of "higher order interactions" have to be either defined as model data functions or I need to collect all agent properties and compute those interactions with the full record of the states of the agents. My problem with the later approach is a matter of storage and speed, since it is, in principle, possible to collect each of these with the knowledge of both the agent and the model at runtime, which is not the case for agent data functions. My problem with the former is that lifting a function that work at the level of pairs or triples of agents to the whole model requires me to commit to symmetries that may not be respected by the dynamics of the agents and to allocate huge arrays every time I want to record. Since I need to scan quite a few parameters, including K and N, this quickly becomes unpractical and I have to restrict my record to just a few like, 10 in about 10000 steps. Thats why I wondered if a more flexible data collection interface is possible.

Libbum commented 3 years ago

So this looks like a really cool example, and perhaps a reason to move forward on something like you've suggested. At the moment we have a low and high level collection API, so there could be an opening for an 'advanced ' one.

I'm afraid I don't quite follow the specifics of what you've described however. Would it be possible for you to share a quick example that demonstrates how the current API causes these inefficiencies (if I could run this example as a test-bed, that would be great), and then how your methods discussed above might change that?

Datseris commented 3 years ago

run!(
    fake_model_instance,
    fake_agent_step!,
    n_steps;
    report=[
    [:agent_property,:agent_other_property]=>ByAgent((x,y)->x+y)=>:agent_complicated_property,
    :agent_property=>Neighbors(((a1prop,a2prop)->max(a1prop,a2prop)),r=1)=>:pairwise_property,
    ],
    report_when=[:agent_complicated_property=>x->mean(x)<1.0]
)

I just need to say that this does not appear intuitive to me. Not only there is a very deep nesting level in the commands, in striking contrast to the current interface that has a nesting level of at most 1 (as adata can only be a Vector{Tuple}), but it is not clear to me what many of the extra keywords should do. I don't think having such clunky interfaces is a step forwrads, after spending several months simplifying Agents.jl as much as possible.

I totally forgot about this assets function until now—it should have been put into an example when we were finalising the collection methods, so at the very least that should be added to the docs. From the looks of it though, we may already have the capabilities you're looking for, just not explained enough through appropriate examples and documentation.

Yes please, I actually had no idea this was possible, lol.

Maybe if we could "view" an AgentBasedModel as a table, but I don't think treating ABMs as tables would be intuitive.

I don't see how this makes sense. What I'd think intuitively that the columns of such a table would be the model-level properties. Problem is, what are the rows? There are no rows, because an ABM exists only at a given instant in time, and has no history (unless explicitly created by the user as a model level parameter).

I'm interested in a few functions like similarity(agent1,agent2) = dot(agent1.C1*agent1.w1, agent2.C1*agent2.w1) and analogous for w2 and C2, or frustration(a1,a2,a3)=similarity(a1,a2)*similarity(a1,a3) + similarity(a2,a3)*similarity(a2,a1) ... ,

@flipgthb What you present here is a very interesting model and would certainly make an interesting addition in our examples library. What is not yet clear to me is whether you are really interested in similarity(agent1,agent2), or the average of similarity(agent1,agent2) over all possible pairs of agent1, agent2. It makes a huge difference. If it is the latter, then I don't see why it is not possible to do it with the low-level data collection. If it is the former, then I don't see a way of avoiding allocating a huge matrix.

So to summarize the discussion on my on words, and correct me if I got it wrong. The gist is that we'd like to have another data aggregation possibility, that instead of aggregating over 1 agent, aggregates over 2 agents as input arguments to a given user function. This of course raises the question on whether the pairs used should be all-to-all, or some kind of nearest neighbors using nearby_agents. Probably the second approach is better, as it can also achieve all-to-all by passing Inf as the interaction radius.

flipgthb commented 3 years ago

All right, I'll try to illustrate better. Consider the following model:

using Agents, DataFrames, Distributions, LightGraphs, LinearAlgebra, StaticArrays

Phi(x) = cdf(Normal(0,1),x)

# pretty much the agent structure of the model I'm currently working on
@agent NNAgent{K,N} GraphAgent begin
    w::MVector{K,Float64}   # (w,C) represent a behovior
    C::MMatrix{K,K,Float64} #
    m::MVector{N,Float64}   # (m,V) represent a different
    V::MVector{N,Float64}   # behavior
end

# this is a slightly simplified version of the actual agent_step! 
# I am using in my model:
# A receiver agent gets an input vector and a classification for this
# vector from a emitter agent and learn as if the emitter was a teatcher
# in a supervised learning machine learning algorithm, except that the
# receiver considers the possibility that the emitter can make a mistake.
# So both the error probability assigned by the receiver to the emitter and
# the receiver weight vector change acordingly the current state of the receiver
# regarding the received information
function nnagent_step!(receiver_agent::NNAgent{K,N},model::AgentBasedModel{<:Agents.AbstractSpace,NNAgent{K,N}}) where {K,N}
    wC_input = normalize!(randn(K))
    emitter_id = rand([e for e in nearby_ids(receiver_agent.pos,model) if e!=receiver_agent.id])
    emitter_answer = sign(model[emitter_id].w ⋅ wC_input)
    receiver_wC_ir = receiver_agent.w ⋅ (wC_input.*emitter_answer) / sqrt(1 + wC_input ⋅ (receiver_agent.C*wC_input))
    receiver_mV_ir = receiver_agent.m[emitter_id] / sqrt(1 + receiver_agent.V[emitter_id])

    p_wC, p_mV = Phi(receiver_wC_ir), Phi(receiver_mV_ir)

    receiver_agent.w += (1-2p_mV)*wC_input*emitter_answer
    receiver_agent.m[emitter_id] += (1-p_wC)
end

# The (w,C) behavior is responsible for processing information from input vectors x,
# where w plays the role a weight vector in a binary classifier.
# I'm interested in how much agents i and j would respond similarly for different inputs,
# which can be done by looking at the cossine between a_i.w and a_j.w
overlap(a_i::A,a_j::A) where {A<:NNAgent} = normalize(a_i.w) ⋅ normalize(a_j.w)
# The (m,V) behavior is reponsible for estimating how much another agent errs about the inputs.
# The probability agent i assigns to agent j to err is the probit of a_i.m[a_j.id] scaled by
# i's uncertainty on its own assignment a_i.V[a_j.id].
# The inhibition/excitation is how agent i responds to information comming from j and are just
# monotonic transformations of p_ij
inhibition(a_i::A,a_j::A) where {A<:NNAgent} = 2*Phi(a_i.m[a_j.id]/sqrt(1 + a_i.V[a_j.id]))-1
excitation(a_i::A,a_j::A) where {A<:NNAgent} = -inhibition(a_i,a_j)
# The following order parameters serve to study the connection between error assinment and 
# input classification, to see if error assignment propagates through neightborhoods and
# such propagation is coherent with input classification
m_frustration(a_i::A,a_j::A,a_k::A) where {A<:NNAgent} = excitation(a_i,a_j)*excitation(a_j,a_k)*excitation(a_i,a_k)
w_frustration(a_i::A,a_j::A,a_k::A) where {A<:NNAgent} = overlap(a_i,a_j)*overlap(a_j,a_k)*overlap(a_i,a_k)
coherence(a_i::A,a_j::A) where {A<:NNAgent} = overlap(a_i,a_j)*(excitation(a_i,a_j) + excitation(a_j,a_i))/2
balance(a_i::A,a_j::A,a_k::A) where {A<:NNAgent} = coherence(a_i,a_j) + coherence(a_j,a_k) + coherence(a_i,a_k)

# simple init function: just choose random normalized vectors for w and m and identities for C and V
function init_model(; K, N)
    agents = [NNAgent{K,N}(i,i,normalize!(randn(K)),diagm(ones(K)),normalize!(randn(N)),ones(N))
              for i in 1:N]
    model = ABM(NNAgent{K,N},GraphSpace(complete_graph(N));
                scheduler=random_activation,
                properties=Dict(:step_prop=>0))

    add_agent_pos!.(agents,[model])

    return model
end

model_step!(model) = model.step_prop += 1

# just a small sanity check
adf = run!(init_model(;K=3,N=5),nnagent_step!,model_step!,2; adata=[:w,:m])[1]

The adf at the end returns what is expected, no problems.

Now, the problem is how to collect the order parameter functions above? Given the constraints imposed by the Agents.jl data collection interface, I need to "lift" these function to work on the model level:

function model_data_experiment_1()
    model_overlap(model) = [overlap(a_i,a_j) for a_i in allagents(model), a_j in allagents(model)]
    model_inhibition(model) = [inhibition(a_i,a_j) for a_i in allagents(model), a_j in allagents(model)]
    model_excitation(model) = [excitation(a_i,a_j) for a_i in allagents(model), a_j in allagents(model)]
    model_coherence(model) = [coherence(a_i,a_j) for a_i in allagents(model), a_j in allagents(model)]
    model_m_frustration(model) = [m_frustration(a_i,a_j,a_k) 
                                  for a_i in allagents(model), a_j in allagents(model), a_k in allagents(model)
                                  if (a_i.id != a_j.id && a_j.id != a_k.id && a_i.id != a_k.id)]
    model_w_frustration(model) = [w_frustration(a_i,a_j,a_k) 
                                  for a_i in allagents(model), a_j in allagents(model), a_k in allagents(model)
                                  if (a_i.id != a_j.id && a_j.id != a_k.id && a_i.id != a_k.id)]
    model_balance(model) = [balance(a_i,a_j,a_k) 
                                  for a_i in allagents(model), a_j in allagents(model), a_k in allagents(model)
                                  if (a_i.id != a_j.id && a_j.id != a_k.id && a_i.id != a_k.id)]

    return [model_overlap,model_inhibition,model_excitation,model_coherence,model_m_frustration,model_w_frustration,model_balance]
end

One problem of this approach is this approach is that I loose the agent id information, since the agents stored in a Dict:

[a.id for a in allagents(init_model(K=3,N=15))]

returns a random order for the agents.

I can't even automate the sorting because all I get is an array, and maybe different replicates for the same parameters scramble the agents in different orders. I rely on agent id for a clustering algorithm to be applied to each order parameter tensors.

This can be solved if I do something like this (omitting the other functions as the idea is the same)

function model_data_experiment_1_as_df()
    model_overlap(model) = DataFrame([(step_prop=model.step_prop,id=a_i.id,other_id=a_j.id, overlap_ij=overlap(a_i,a_j)) 
                            for a_i in allagents(model) for a_j in allagents(model)])

    return [model_overlap]
end

and concat the collected data I get a DataFrame

mdf = vcat(run!(init_model(;K=3,N=5),nnagent_step!,model_step!,2; mdata=model_data_experiment_1_as_df())[2].model_overlap...)

Or I need to record the states of each agent and "lift" these functions to work with the agent DataFrame. This can be done only if I carry the GraphSpace of the model as well. Sure, in the example above this is irrelevant since I used the a complete graph topology, but my plans include testing the model for different network topologies. This approach is difficult because I lose the Agents.jl interface to GraphSpace and also forces me to work at dataframes grouped by step. It is feasible, though. Another disadvantage of this approach is that I'll eventually run out of memory, because I need to maintain the agents state dataframes AND the huge tensors of interest.

I cannot collect just sample moments, like the average, because the whole point of this model is to study polarization, so I expect the averages to be zero both when the model polarized or if just stays random. Sure, I could measure higher sample moments, like the skew and kurtosis, but sample statistics throw away to much information.

At this point, I fell like I'm fighting Agents.jl to give me the info that was there during the simulation. This example has pretty much what I'm interested into in my model: higher-order interactions, pairwise and beyond, but I can't rely on the data collection to give me the context of the functions I need (basically the ids of the agents involved in the calculation) unless I hack my way into the data collector or extract all microscopic information from the model and calculate the order parameters from the DataFrames.

I'll address the other questions in the next comment

Libbum commented 3 years ago

A lot to unpack here, so will take some time to digest. At least some of your issues could be solved by using the by_id scheduler though, I think?

Libbum commented 3 years ago

Is there a helper function Phi I'm missing for this, or is there some additional package I need?

flipgthb commented 3 years ago

@Libbum:

I'm afraid I don't quite follow the specifics of what you've described however. Would it be possible for you to share a quick example that demonstrates how the current API causes these inefficiencies (if I could run this example as a test-bed, that would be great), and then how your methods discussed above might change that?

I hope the example above helps. Regarding how the methods I proposed can help, consider the "lift" of the order parameters functions I did

function model_data_experiment_1()
    model_overlap(model) = [overlap(a_i,a_j) for a_i in allagents(model), a_j in allagents(model)]
    model_inhibition(model) = [inhibition(a_i,a_j) for a_i in allagents(model), a_j in allagents(model)]
    model_excitation(model) = [excitation(a_i,a_j) for a_i in allagents(model), a_j in allagents(model)]
    model_coherence(model) = [coherence(a_i,a_j) for a_i in allagents(model), a_j in allagents(model)]
    model_m_frustration(model) = [m_frustration(a_i,a_j,a_k) 
                                  for a_i in allagents(model), a_j in allagents(model), a_k in allagents(model)
                                  if (a_i.id != a_j.id && a_j.id != a_k.id && a_i.id != a_k.id)]
    model_w_frustration(model) = [w_frustration(a_i,a_j,a_k) 
                                  for a_i in allagents(model), a_j in allagents(model), a_k in allagents(model)
                                  if (a_i.id != a_j.id && a_j.id != a_k.id && a_i.id != a_k.id)]
    model_balance(model) = [balance(a_i,a_j,a_k) 
                                  for a_i in allagents(model), a_j in allagents(model), a_k in allagents(model)
                                  if (a_i.id != a_j.id && a_j.id != a_k.id && a_i.id != a_k.id)]

    return [model_overlap,model_inhibition,model_excitation,model_coherence,model_m_frustration,model_w_frustration,model_balance]
end

Despite how dumb this lifts are, they show that I have to repeat a lot of code to achieve them, so this is why I thought some kind of high level selector (like Neightbors) could help to avoid that. Granted, I could have achieved the same with something like

function lift_pairwise_to_model(fs)
    mfs = map(fs) do f
        mfname = Symbol("model_",f)
        @eval ($mfname)(model) = [$f(a1,a2) for a1 in allagents(model), a2 in allagents(model)]
    end
    return mfs
end

mdata = lift_pairwise_to_model([overlap,inhibition])

which mitigates the code repetition...

A lot to unpack here, so will take some time to digest. At least some of your issues could be solved by using the by_id scheduler though, I think?

The by_id scheduler makes the agents activate in order, which is not what I want.

@Datseris:

I just need to say that this does not appear intuitive to me. Not only there is a very deep nesting level in the commands, in striking contrast to the current interface that has a nesting level of at most 1 (as adata can only be a Vector{Tuple}), but it is not clear to me what many of the extra keywords should do. I don't think having such clunky interfaces is a step forwrads, after spending several months simplifying Agents.jl as much as possible.

Actually, I agree with both of you that what I proposed is too complex and possibly not worth the work. But...

I don't see how this makes sense. What I'd think intuitively that the columns of such a table would be the model-level properties. Problem is, what are the rows? There are no rows, because an ABM exists only at a given instant in time, and has no history (unless explicitly created by the user as a model level parameter).

Thats the point! There is no isomorphism between Agent Model and Tables, however the Agents.jl data collection interface provides two "surjections" from an agent model onto tables, one at the bottom (the microscopic state of each agent or functions of each agent's properties) and one at the top (model properties or functions of model properties, including the agents). Any other mesoscale must be projected into one of these surjections, and that is why I feel like fighting the interface for some of the functions I need

What you present here is a very interesting model and would certainly make an interesting addition in our examples library. What is not yet clear to me is whether you are really interested in similarity(agent1,agent2), or the average of similarity(agent1,agent2) over all possible pairs of agent1, agent2. It makes a huge difference. If it is the latter, then I don't see why it is not possible to do it with the low-level data collection. If it is the former, then I don't see a way of avoiding allocating a huge matrix.

I am interested in the matrices (and/or tensors of higher rank), since the sample moments are not enough to describe the macroscopic dynamics. I usually do kernel density estimation, some clustering and other methods that require more information.

So to summarize the discussion on my on words, and correct me if I got it wrong. The gist is that we'd like to have another data aggregation possibility, that instead of aggregating over 1 agent, aggregates over 2 agents as input arguments to a given user function. This of course raises the question on whether the pairs used should be all-to-all, or some kind of nearest neighbors using nearby_agents. Probably the second approach is better, as it can also achieve all-to-all by passing Inf as the interaction radius.

I think you, more or less, nailed it. It is not a matter of syntax, but how to collect data at multiple scales without "hacking" my pairwise or triple-wise functions into model data. I think that, maybe, it would be enough to add context to data collected, somewhat like I did above with the agents ids, but I don't know.

flipgthb commented 3 years ago

Is there a helper function Phi I'm missing for this, or is there some additional package I need?

Sorry, I forgot to paste it from the notebook. I edited the comment, but here is the definition:

using Distributions
Phi(x) = cdf(Normal(0,1),x)

Libbum commented 3 years ago

Thanks. I'll give this some thought and come back with comments once I've got some suggestions.

Libbum commented 3 years ago

From what I can follow, it looks like the biggest omission in our API is the ability to collect data from agent interactions. Felippe has suggested two methods above, with some open questions left to solve.

Method one: some form of lifting interface, which stores matrices in the DataFrame (but the examples above have concerns about identification). Method two: Explicit pairings in the DataFrame, as shown by the mdf = vcat(run! ... solution.

I'm unsure if you have a preferred solution between these two Felippe?

Method one we can probably do very simply with no fundamental changes, just some additional helpers:

# Two exported methods             
pair_map(f, model) = (f(i, j) for (i, j) in pair_iter(model))  
pair_map(f, model, filter) = (f(i, j) for (i, j) in pair_iter(model) if filter((i, j)))
triple_map(f, model) = (f(i, j, k) for (i, j, k) in triple_iter(model))                
triple_map(f, model, filter) = 
    (f(i, j, k) for (i, j, k) in triple_iter(model) if filter((i, j, k)))
# Two internals
pair_iter(model) = ((model[i], model[j]) for i in by_id(model), j in by_id(model))
triple_iter(model) = (
    (model[i], model[j], model[k]) for
    i in by_id(model), j in by_id(model), k in by_id(model)    
)

This is all zero allocation, so it's pretty decent in performance. We use by_id, so the matrices are consistent over runs, and have two dispatched methods for each map function; allowing users to pass any filter function that accepts an iterable and returns Bool.

This simplifies the collection quite a lot:

function experiment_2(model)                                                                                                                                                    
    model_overlap(model) = collect(pair_map(overlap, model))                     
    model_inhibition(model) = collect(pair_map(inhibition, model))                   
    model_excitation(model) = collect(pair_map(excitation, model))                   
    model_coherence(model) = collect(pair_map(coherence, model))
    model_m_frustration(model) = collect(triple_map(m_frustration, model, allunique))
    model_w_frustration(model) = collect(triple_map(w_frustration, model, allunique))
    model_balance(model) = collect(triple_map(balance, model, allunique))

    return [
        model_overlap,   
        model_inhibition,
        model_excitation,
        model_coherence,    
        model_m_frustration,
        model_w_frustration,
        model_balance,
    ]
end
model = init_model(; K = 3, N = 5)
_, data = run!(model, nnagent_step!, model_step!, 2; mdata = experiment_2(model))

Method two—I'll need some more time on that. Let me know your thoughts on this one in the mean time.

Datseris commented 3 years ago

This seems good to me.

What I find a bit disconnected is how we are discussing about a situation that is clearly aggregation over agents, but we create model-level data collection for it.

flipgthb commented 3 years ago

@Libbum:

I'm unsure if you have a preferred solution between these two Felippe?

I prefer Method one, as it is simpler and suffices.

# Two exported methods             
pair_map(f, model) = (f(i, j) for (i, j) in pair_iter(model))  
pair_map(f, model, filter) = (f(i, j) for (i, j) in pair_iter(model) if filter((i, j)))
triple_map(f, model) = (f(i, j, k) for (i, j, k) in triple_iter(model))                
triple_map(f, model, filter) = 
    (f(i, j, k) for (i, j, k) in triple_iter(model) if filter((i, j, k)))
# Two internals
pair_iter(model) = ((model[i], model[j]) for i in by_id(model), j in by_id(model))
triple_iter(model) = (
    (model[i], model[j], model[k]) for
    i in by_id(model), j in by_id(model), k in by_id(model)    
)

I think this is quite close to what I initially proposed, but cleaner. My only concern is that it solves the problem up to triple interaction order, but it effectively helps with all the problems I have in my model.

@Datseris:

What I find a bit disconnected is how we are discussing about a situation that is clearly aggregation over agents, but we create model-level data collection for it.

I'm not sure how to achieve this with aggregation over agents. You mean I should be doing something like this?

run!(init_model(;K=3,N=5),nnagent_step!,model_step!,2; adata=[(:w, aggregating_version_of_overlap)])
# or
run!(init_model(;K=3,N=5),nnagent_step!,model_step!,2; adata=[(:w, overlap, someway_to_filter_pairs)])

Datseris commented 3 years ago

At the moment it is not possible. However, the current interface is that if you want to aggregate a property over agents, e.g. the mean of agent property weight, this is passed as an option to adata. Here we are doing a two-fold aggregation again over some agent property, but it is passed as model aggregation option.

Libbum commented 3 years ago

An omission from my post last night: experiment_2 would be good enough if you were not adding or removing agents over the run, but if you were, the best solution would be to just add by_id to the return vector so you had the correct indices for each matrix at each step.

What I find a bit disconnected is how we are discussing about a situation that is clearly aggregation over agents, but we create model-level data collection for it.

Yeah. I think with this sort of thing in its current form it would be a model property, since its 'more' than an agent property.

However, it would be possible to have methods like this and put them in the agent frame if we did something like Method 2.

Run the pair_map, but put it in the table like

id	matched	overlap	inhibition
1	2	54.75	6468.56

The complication comes when you need the triplet too

id	matched	overlap	balance
1	(2,)	54.75	missing
1	(2, 3)	missing	46.36

We know that including missings causes a notable performance loss, so this is not ideal. I don't think it should be a consideration.

My only concern is that it solves the problem up to triple interaction order, but it effectively helps with all the problems I have in my model.

Do you mean that you'd like the possibility of N interaction agents? That would be fine. I think I could write a macro that generated something similar to above, but would have syntax like map_n(function, model, filter, order), where order = 5 would give you the complete comparison of five agents. If not, this problem would become quite intractable once you got much higher than that due to time (and probably memory) constraints, so we could just manually add a few more methods and the map_n function could be a wrapper that called the specific implementations.

Datseris commented 3 years ago

We should be careful to not go too far into implementing something overly specific for a singular use case.

I think we can go with the "Method One" of Tim. However, the example defines model_overlap(model) = collect(pair_map(overlap, model)) which means that this outputs a vector. Probably makes more sense to use mean instead of collect?

Libbum commented 3 years ago

Probably makes more sense to use mean instead of collect?

No, I don't think so. Fellipe: I am interested in the matrices (and/or tensors of higher rank), since the sample moments are not enough to describe the macroscopic dynamics.

But that's neither here, nor there - since that's a portion of a user implementation. We would only provide the pair/triplet/higher order iterators and maps.

Datseris commented 3 years ago

@flipgthb did you try Tim's "Method One" in your model? Can you discuss here the output?

flipgthb commented 3 years ago

We know that including missings causes a notable performance loss, so this is not ideal. I don't think it should be a consideration

This is related to why I prefer Method one, as including higher order interactions in the data frame require a data frame for each n-fold interaction to avoid missing values (I think this is the same reasoning behind the split between agent data and model data, am I right?).

Do you mean that you'd like the possibility of N interaction agents? That would be fine. I think I could write a macro that generated something similar to above, but would have syntax like map_n(function, model, filter, order), where order = 5 would give you the complete comparison of five agents. If not, this problem would become quite intractable once you got much higher than that due to time (and probably memory) constraints, so we could just manually add a few more methods and the map_n function could be a wrapper that called the specific implementations.

Thats exactly what I mean.

@Datseris, yes, I tried and it works fine. The access to pair_map and triple_map makes it way simpler to define the functions for the order parameters in the same terms as they appear in the theory and, well, map them to the model.

Libbum commented 3 years ago

Cool. So in the end, what we're really after is:

A new iterator similar to allagents, but it works over pairs/triples/N collections, and a map method that allows us to apply methods. This has the benefit that we're not focused on data collection per-se, just improvements to the Agents API for when higher-order information processing is required.
A new section should be put in the documentation about the assets function to suggest this practice when you need to wrap complex data collection methods.
A new example, or extend a current one to showcase what we're discussing here.

I can get to work on that soon.

flipgthb commented 3 years ago

A new example, or extend a current one to showcase what we're discussing here.

Maybe I could help with that, though I am not allowed to publish my current model before we submit the paper to a journal (I am finishing with the figures)

Libbum commented 3 years ago

Most work is done for this issue in #417 and 7579d4856. Example requirements can be tracked in #422.