Location and access for metadata

JuliaGraphs / GraphsBase.jl

Basic interface and structures for the JuliaGraphs ecosystem

http://juliagraphs.org/GraphsBase.jl/

MIT License

11 stars 1 forks source link

Location and access for metadata #14

Open gdalle opened 9 months ago

gdalle commented 9 months ago

Should it be in the vertex / edge object itself? What is the right syntax to access it? Can it be linked with the edge weight?

gdalle commented 9 months ago

mtfishman commented 9 months ago

I would vote for separating the metadata from the edges, and think of the edges as indices for accessing the metadata (and indicating where there is nontrivial/nonzero metadata).

Say we have a metagraph wrapping a simple graph like:

struct MetaGraph{T<:Integer,D} <: AbstractGraph{T,SimpleEdge{T}}
  graph::SimpleGraph{T}
  edge_data::MyComplicatedEdgeDataStorage{SimpleEdge{T},D}
end

It could require some non-trivial calculation to access edge data from the edge_data structure. In the interface proposed in #13 the edges would be defined like this (if I understand correctly):

struct DataEdge{T<:Integer,D} <: AbstractEdge{T,D}
    src::T
    dst::T
    data::D
end

function GraphsBase.out_edges(g::MetaGraph{T}, u) where {T}
    return DataEdge{T}(u, v, get_data(g.edge_data, SimpleEdge(u, v)) for v in out_neighbors(g, u))
end

This seems like a lot of unnecessary work to do each time you ask for the edges.

Also, it seems like another disadvantage is that there would be many edge types which different graphs would have to understand. If I had both a MetaGraph g defined above and also its parent SimpleGraph g.graph, it seems like they should have the same out_edges (say so I could pass edges accessed from the MetaGraph and have those easily used by the SimpleGraph). I know some of that could be handled by generic code and conversions, but it seems better in practice to have fewer "simple" edge types.

As for the weights, I think they can be thought of as just a special kind of metadata, which again I don't think should be stored with the edges, but instead accessed from the graph using an edge in some generic way (weight(g::AbstractGraph, e)). I think that allows a lot more flexibility in how the weight function can be customized, i.e. it could be customized through overloading on a graph type or even an instance of a graph by storing a weight function with the graph, as opposed to just overloading it based on the type of the edge (which seems to lead to an awkward situation that you need to define a new edge type to define a new weight function).

CameronBieganek commented 9 months ago

Here's my preferred interface for metadata.

First the user defines their own vertex and edge metadata types. These could be any objects with getproperty and setproperty! methods, but typically they would be struct, mutable struct, or named tuple.

mutable struct VertexData
    a::Int
    b::String
end

mutable struct EdgeData
    c::Float64
    d::String
end

Then the user can create a graph that uses those metadata types. Suppose we use Char for the vertex type. Then creating an empty graph would look like this:

g = Graph{Char, VertexData, EdgeData}()

After adding vertices, edges, and metadata, the metadata can be accessed as follows:

# Access vertex metadata:
g[u].a

# Access edge metadata:
g[e].c
g[u, v].d

And, if the underlying metadata types are mutable, then the metadata can also be set with the following syntax:

# Set vertex metadata:
g[u].a = 42

# Set edge metadata:
g[e].c = 3.14
g[u, v].d = "hello"

mtfishman commented 9 months ago

@CameronBieganek agreed, that is very similar to the design of DataGraphs.jl, based on a getindex and setindex! interface for accessing and modifying metadata.

One unfortunate thing with committing to g[u, v] as the syntax for accessing edge metadata is that it precludes nice syntax for cartesian indexing of vertices (if the vertices are some kind of cartesian index), similar to the Dict syntax:

julia> Dict((1, 2) => "x", (2, 1) => "y")[1, 2]
"x"

But that leaves open the question of how to access edge metadata, in DataGraphs I went with g[Edge(u, v)] or g[u => v] but I could see arguments against those, since it leaves ambiguity if vertices themselves happen to be Edge or Pair types (we just disallow that, but that might not be a great design for a general graph type). One design could be to have a universal edge indexing type g[EdgeIndex(u, v)] which would be disallowed as a vertex type or just having a function get_edge_data(g, u, v) but I could see arguments against both of those choices. Another option is that in the case of ambiguity, say if a vertex is a Pair or Edge, one has to access vertex data with Vertex indexing type g[Vertex(u => v)] or g[Vertex(Edge(u, v))]. Again I think that may not be ideal since then that could force people to need to use Vertex all over the place to make graph code generic to any graph types.

I think metadata can mostly be orthogonalized from the basic design and interface (though of course should be anticipated), since most graph algorithms don't require (or just ignore) metadata. The primary examples where metadata is needed that I came across is in operations where graphs are being merged so then you need to know how to merge metadata, but that would need some specialized functionality for different metagraph types anyway since you can't really anticipate how metadata will be stored (similar to how you can't really assume much about the storage structure of AbstractMatrix, though there could be a fallback standard metadata type comparable to Dict or Matrix in Base Julia). I think that could be solved by following the example of Julia Base AbstractArrays by just minimally requiring access to metadata through getindex, and then making use of generic concepts like similar and empty for making new metagraph structures.

mtfishman commented 9 months ago

I don't really see a particular need for requiring getproperty and setproperty! for metadata, I don't see why from the design of a generic graph interface you need to assume anything about what the metadata will be, just that you can access it and set it in the case of mutable metagraphs.

Regarding the issue around getindex/setindex! syntax, it may be better from an abstract type interface perspective (though not from an external interface perspective) to define lower level interface functions (EDIT: Simplified to just getdata, similar to the proposal in https://discourse.julialang.org/t/rfc-graphinterface-jl-an-interface-proposal-for-graphs-jl-version-2-0/104518/14 by @gdalle):

getdata(g, v) # Get vertex data
getdata(g, u, v) # Get edge data
setdata!(g, v, value) # Set vertex data if mutable
setdata!(g, u, v, value) # Set edge data if mutable

and let particular graph subtypes decide how getindex and setindex! act. There could be default fallbacks from getindex to getdata, which could then be overloaded if a graph instance wants getindex to mean something else or have slightly different syntax (analogous to getproperty and getfield in Base Julia).

CameronBieganek commented 9 months ago

I went with g[Edge(u, v)] or g[u => v] but I could see arguments against those, since it leaves ambiguity if vertices themselves happen to be Edge or Pair types

That's a very good point. I thought about that a while ago, and then I forgot about it. 😅

I don't really see a particular need for requiring getproperty and setproperty! for metadata, I don't see why from the design of a generic graph interface you need to assume anything about what the metadata will be, just that you can access it and set it in the case of mutable metagraphs.

You're right, the user should be able to use whatever object they want to store their metadata, including Dict, etc.

Here's a modified proposal:

vertex_data(g) and edge_data(g) (or some other spelling) return indexable objects that allow access to metadata.

Here's a demonstration of what it would look like in practice:

# User defined metadata types. User can use any type of object.
mutable struct VertexData
    a::Int
    b::String
end

mutable struct EdgeData
    c::Float64
    d::String
end

# ... (Create graph here.)

vd = vertex_data(g)
ed = edge_data(g)

vd['x'] = VertexData(1, "hello")
vd['y'] = VertexData(2, "world")
vd['x'].a = 100
vd['y'].b = "foo"

vd['x'].a  # 100
vd['x'].b  # "hello"

ed['x', 'y'] = EdgeData(1.2, "asdf")
ed[Edge('x', 'y')].d = "qwer"

ed['x', 'y'].c  # 1.2
ed['x', 'y'].d  # "qwer"

I just wish there was a more compact way to write it than vertex_data. If we shorten the function names to vdata and edata, then this doesn't look too bad:

vdata(g)['x'].a = 123

If one uses a dictionary to store their metadata instead of a struct or named tuple, that statement would look something like this:

vdata(g)['x'][:a] = 123

There are some oddball ways that we could do it, but they might be a little too odd:

g[]  # Returns an object that can be indexed to obtain vertex metadata.
g()  # Returns an object that can be indexed to obtain edge metadata.

# Examples:
g[]['x'].a = 123
g()['x', 'y'].a = 3.14

or this:

g(:vertex)  # Returns an object that can be indexed to obtain vertex metadata.
g(:edge)    # Returns an object that can be indexed to obtain edge metadata.

# Examples:
g(:vertex)['x'].a = 123
g(:edge)['x', 'y'].a = 3.14

I considered g['x'] and g('x', 'y') for vertex and edge metadata, respectively, but then you can't assign a whole new object as the metadata for a given edge.

mtfishman commented 9 months ago

That's close to the design I have in DataGraphs.jl, where the minimal requirement of the interface for accessing metadata is that a type implements vertex_data(g) and edge_data(g) which output dictionaries (or some other data structure, at least in principle) that can be indexed into with vertices and edges, so then getdata and setdata! can be defined through vertex_data(g) and edge_data(g):

getdata(g, v) = vertex_data(g)[v] # Get vertex data
getdata(g, u, v) = edge_data(g)[Edge(u, v)] # Get edge data
setdata!(g, v, value) = (vertex_data(g)[v] = value) # Set vertex data if mutable
setdata!(g, u, v, value) = (edge_data(g)[Edge(u, v)] = value) # Set edge data if mutable

I just use getindex and setindex! in DataGraphs.jl but I think it is too opinionated for a general graph interface for those to be part of the interface requirement and we should allow AbstractGraph subtypes to decide the meaning of getindex and setindex!.

In general, it may not be practical or efficient to implement vertex_data(g) and edge_data(g) for a given graph, since the vertex data and edge data may be tangled up in some complicated data structure, in which case getdata and setdata! could be overloaded (which means that code in Graphs.jl should only make use of getdata and setdata!).

One reason to lean towards just making getdata and setdata! the required interface is that it is easier to encode sparsity into the data structures. For example, in general undirected graphs will have symmetry between the data on the same edge in opposite directions. Often getdata(g, u, v) == getdata(g, v, u), however I have examples where getdata(g, u, v) == f(getdata(g, v, u)). Maybe one could argue that you should use a directed graph at that point, but the underlying graph is undirected for all purposes related to how it should be treated in graph algorithms, it's only the data that has some sense of directionality. Additionally, there may be other symmetry such that getdata(f, u, v) == getdata(f, w, x) which isn't easy to encode in a single data structure.

Of course you could have vertex_data(g) and edge_data(g) output a data structure which then encode these symmetries or relationships between elements through getindex and setindex!, but it seems unfortunate to require users to define data structures (which could get kind of complicated) as opposed to just defining function overloads of getdata and setdata!.

CameronBieganek commented 9 months ago

Of course you could have vertex_data(g) and edge_data(g) output a data structure which then encode these symmetries or relationships between elements through getindex and setindex!, but it seems unfortunate to require users to define data structures (which could get kind of complicated) as opposed to just defining function overloads of getdata and setdata!.

To clarify, by "users" do you mean implementers of new graph types, or users of existing graph types? Users of existing graph types shouldn't be overloading anything. And implementers of new graph types can, as you said, do anything they want with getindex(edge_data(g), ...) and set_index!(edge_data(g), ...). I think it's ok to have higher expectations of graph implementers than we do of graph users.

mtfishman commented 9 months ago

Yes, I meant implementers of new graph types. I was just arguing that getdata(g, ...) and setdata!(g, ...) may be simpler overloading interfaces than getindex(edge_data(g), ...) and setindex!(edge_data(g), ...) in practice, especially for more complicated graph types where it's not so straightforward to extract edge_data and vertex_data objects, since then new graph types have the burden of two nontrivial overloads: both edge_data/vertex_data and getindex defined for whatever type those output, instead of just one, getdata.

CameronBieganek commented 9 months ago

Of course it's partly a matter of taste. I prefer the square bracket syntax sugar for getting and setting values, rather than a function like setdata!(g, u, v, value).

especially for more complicated graph types where it's not so straightforward to extract edge_data and vertex_data objects, since then new graph types have the burden of two nontrivial overloads: both edge_data/vertex_data and getindex

I would imagine most graph implementers would use the following pattern:

struct VertexData{V}
    g::Graph{V}
end

struct EdgeData{V}
    g::Graph{V}
end

vertex_data(g::Graph) = VertexData(g)
edge_data(g::Graph) = EdgeData(g)

# Define getindex and setindex! overloads for VertexData and EdgeData.

So, vertex_data and edge_data just return wrappers around the graph to facilitate indexing.

Now, if we want to support multigraphs, then getdata would need the following methods:

getdata(g::AbstractGraph, v)
getdata(g::AbstractGraph, u, v)
getdata(g::AbstractGraph, e::AbstractEdge)

But as you pointed out above, this is flawed, because the generic graph interface allows for graphs where the vertex type is a subtype of AbstractEdge. So, you would need separately named functions, like this:

get_vertex_data(g, v)
set_vertex_data!(g, v, value)
get_edge_data(g, u, v)
get_edge_data(g, e)
set_edge_data!(g, u, v, value)
set_edge_data!(g, e, value)

But that starts to get a bit unwieldy.

mtfishman commented 9 months ago

I would just disallow getdata(g::AbstractGraph, e::AbstractEdge) (EDIT: By disallow, I mean define it as accessing vertex data, treating e as a vertex) to avoid ambiguity in the interface function, graph implementers can overload getindex however they want to including indexing with AbstractEdge to access edge data.

mtfishman commented 9 months ago

I think in general it is overly burdensome to require creating objects like VertexData and EdgeData as part of the interface, it won't always be natural to have edge and vertex data living in separate data structures like that, so then if they don't we are requiring graph types to then have both a graph type and then associated vertex and edge data types. We'd essentially be requiring general metagraphs to define 3 types instead of 1 type (the graph type) that overloads getdata.

mtfishman commented 9 months ago

Sorry, I didn't see that you were referring to multigraphs. Indeed, I hadn't thought about that case, that sounds like an argument for having separate get_vertex_data and get_edge_data functions. I guess also hypergraphs would be an argument for that as well, though that could be handled with getdata(g, v...). For multigraphs, how would edge data work? Is there edge data associated with each edge between two vertices?

I could imagine having getdata(g, u, v) that outputs a data structure for data associated with all of the edges connecting vertices u and v (for example a Dict that can be indexed by each edge connecting u and v), but maybe that's too simplistic.

CameronBieganek commented 9 months ago

For multigraphs, how would edge data work? Is there edge data associated with each edge between two vertices?

Yeah, presumably each edge connecting vertices u and v could have its own separate metadata. So, one would need the actual edges, e.g. Edge('a', 'b', 1) and Edge('a', 'b', 2), in order to lookup the edge metadata.

CameronBieganek commented 9 months ago

it is overly burdensome to require creating objects like VertexData and EdgeData

They're not explicitly required, though in practice something like EdgeData might be necessary if you need to provide two forms of indexing, ed[u, v] and ed[e], which a raw dictionary would not provide. You could, however, have vertex_data return a raw dictionary.

it won't always be natural to have edge and vertex data living in separate data structures like that

Perhaps I should have called them VertexDataAccessor and EdgeDataAccessor. In that example, they are very lightweight structs that only contain a reference to the graph. They do not contain the actual metadata. The actual metadata is contained somewhere in the graph g, stored in whatever fashion the graph implementer sees fit.

mtfishman commented 9 months ago

I understand they could be lightweight, but my primary point still stands that it (in many cases, most that I can think of) requires defining types (even if they are just simply wrapping the graph) that are beyond the original graph type instead of just overloading functions on the original graph type, which is extra burden for developers and I don't think is a great idea to have as part of the required interface (perhaps a standard to go by is: do you think that kind of design would make it into Base Julia?). Anyway, it's mostly splitting hairs and I think we're in general agreement about the design direction.

mtfishman commented 9 months ago

For multigraphs, it seems like the interface for get_edge_data would be closely tied to what the edges(g) iterator outputs, i.e. it could output a flat list of every edge (including multiple per vertex pair if there are multiedges), or a single edge per pair of vertices where those edges themselves store multiple edges, or possibly an edge multiplicity in the case of indistinguishable edges. Seems like get_edge_data needs to account for those various options, but the interface:

get_vertex_data(g, v)
set_vertex_data!(g, v, value)
get_edge_data(g, e)
set_edge_data!(g, e, value)

seems to cover that. Probably get_edge_data(g, u, v) doesn't make sense as part of a general interface since it's too closely tied to simple edges with a src and dst.

CameronBieganek commented 9 months ago

my primary point still stands that it (in many cases, most that I can think of) requires defining types (even if they are just simply wrapping the graph) that are beyond the original graph type instead of just overloading functions on the original graph type

If we remove the various foo(g, u, v) signatures from the interface, which we might have to do in order to fully support multigraphs, then vertex_data(g) and edge_data(g) could just return dictionaries. But I guess for your case where the metadata is somehow intertwined you would probably still need to define wrapper structs. Still, I think for most graph types with metadata the vertex and edge metadata is well separated and can easily be stored in a dictionary. So, it seems like you are arguing from the point of view of a very uncommon type of metadata graph. If one has unusual needs for their graph type, it is not that onerous that they might need to define a couple of simple wrapper structs.

mtfishman commented 9 months ago

I think we'll have to agree to disagree about this point for now, might have to just try them both and see how they work in the wild.