Customizable Graph Simplification

rohanaras commented 3 years ago

Is your feature proposal related to a problem?

I've been working on a side project with pedestrian/sidewalk networks recently (and some bike, but mostly ped) and have found myself in situations where it would be useful to simply the network down from the unsimplified OSM network but the existing simplification function is far too strict. In particular, I would like to be able to leave nodes in places where the infrastructure in an area switches from dedicated to non-dedicated. By doing this, nodes capture all of the places where someone walking along the graph has to make a decision—not only to go left or right at an intersection, but also the decision to turn around because (oh no) the sidewalk ends or the trail doesn't have a marked crosswalk across a busy road.

Describe the solution you'd like to propose

I would like to be able to be able to pass a custom _is_endpoint function to simplification.simplify_graph (and simplification._get_paths_to_simplify as a result) as a parameter. The existing strict argument could be left as is or appended to a kwargs dictionary if the chosen function is the default.

These changes might also require a new G.graph["simplified"] flag and simplification._is_simplified function, though I have not fully thought through how to reimplement these. My initial impression is that this line could be modified by replacing True with the function name and the second condition in this line could simply be removed.

Describe alternatives you've considered

I do have the functionality I want as it is, but it's required me to fully duplicate the simplify_graph and _get_paths_to_simplify functions in their near entirety. My current setup looks like this:

def custom_is_endpoint(G, node):
    """
    Is node a true endpoint of an edge (considering dedicated ped infrastructure).
    Return True for same cases as osmnx.simplification._is_endpoint OR if
    node represents transition between dedicated and non-dedicated infrastructure.
    An end point is a node that either:
    1) is its own neighbor, ie, it self-loops.
    2) or, has no incoming edges or no outgoing edges, ie, all its incident edges point
    inward or all its incident edges point outward.
    3) or, it does not have exactly two neighbors and degree of 2 or 4.
    4) or, represents the transition between any of the following: sidewalks, marked or
    unmarked crosswalks, and streets without dedicated infrastructure.
    ----------
    G : networkx.MultiDiGraph
        input graph
    node : int
        the node to examine
    strict : bool
        if False, allow nodes to be end points even if they fail all other rules
        but have edges with different OSM IDs
    Returns
    -------
    bool
    """
    is_strict_endpoint = ox.simplification._is_endpoint(G, node, strict=True)

    if is_strict_endpoint:
        return True

    pedestrian_highway_types = set(["footway", "path", "cycleway", "steps"])
    marked_crossing_types = [
        "marked",
        "zebra",
        "pedestrian_signals",
        "traffic_signals",
        "traffic_lights",
    ]

    node_dedicated_adj = False
    node_marked_adj = False
    node_unmarked_adj = False
    node_undedicated_adj = False
    for neighboring_node_multi_edge in G[node].values():
        for edge_attributes in neighboring_node_multi_edge.values():
            undedicated_edge = (
                edge_attributes["highway"] not in pedestrian_highway_types
            )

            marked_edge = False
            if "crossing" in edge_attributes:
                marked_edge = edge_attributes["crossing"] in marked_crossing_types
                unmarked_edge = not marked_edge
                dedicated_edge = False
            else:
                dedicated_edge = not undedicated_edge
                marked_edge = False
                unmarked_edge = False

            node_dedicated_adj = node_dedicated_adj or dedicated_edge
            node_undedicated_adj = node_undedicated_adj or undedicated_edge
            node_marked_adj = node_marked_adj or marked_edge
            node_unmarked_adj = node_unmarked_adj or unmarked_edge

    is_transition_node = (
        node_dedicated_adj + node_marked_adj + node_unmarked_adj + node_undedicated_adj
    ) > 1

    return is_transition_node

def custom_get_paths_to_simplify(G):
    """
    Generate all the paths to be simplified between endpoint nodes.
    The path is ordered from the first endpoint, through the interstitial nodes,
    to the second endpoint.
    Parameters
    ----------
    G : networkx.MultiDiGraph
        input graph
    Yields
    ------
    path_to_simplify : list
    """
    # first identify all the nodes that are endpoints
    endpoints = set([n for n in G.nodes() if custom_is_endpoint(G, n)])
    ox.utils.log(f"Identified {len(endpoints)} edge endpoints")

    # for each endpoint node, look at each of its successor nodes
    for endpoint in endpoints:
        for successor in G.successors(endpoint):
            if successor not in endpoints:
                # if endpoint node's successor is not an endpoint, build a path
                # from the endpoint node, through the successor, and on to the
                # next endpoint node
                yield ox.simplification._build_path(G, endpoint, successor, endpoints)

def custom_simplify_graph(G):
    """
    Simplify a graph's topology by removing interstitial nodes.
    Simplify graph topology by removing all nodes that are not intersections,
    dead-ends, or infrastructure transitions. Create an edge directly between
    the end points that encapsulate them, but retain the geometry of the
    original edges, saved as a new `geometry` attribute on the new edge. Note
    that only simplified edges receive a `geometry` attribute. Some of the
    resulting consolidated edges may comprise multiple OSM ways, and if so,
    their multiple attribute values are stored as a list.
    Parameters
    ----------
    G : networkx.MultiDiGraph
        input graph
    Returns
    -------
    G : networkx.MultiDiGraph
        topologically simplified graph, with a new `geometry` attribute on
        each simplified edge
    """
    if ox.simplification._is_simplified(G):
        raise Exception(
            "This graph has already been simplified, cannot simplify it again."
        )

    ox.utils.log("Begin topologically simplifying the graph...")

    # make a copy to not mutate original graph object caller passed in
    G = G.copy()
    initial_node_count = len(G)
    initial_edge_count = len(G.edges)
    all_nodes_to_remove = []
    all_edges_to_add = []

    # generate each path that needs to be simplified
    for path in custom_get_paths_to_simplify(G):

        # add the interstitial edges we're removing to a list so we can retain
        # their spatial geometry
        edge_attributes = dict()
        for u, v in zip(path[:-1], path[1:]):

            # there should rarely be multiple edges between interstitial nodes
            # usually happens if OSM has duplicate ways digitized for just one
            # street... we will keep only one of the edges (see below)
            if not G.number_of_edges(u, v) == 1:
                ox.utils.log(
                    f"Found multiple edges between {u} and {v} when simplifying"
                )

            # get edge between these nodes: if multiple edges exist between
            # them (see above), we retain only one in the simplified graph
            edge = G.edges[u, v, 0]
            for key in edge:
                if key in edge_attributes:
                    # if this key already exists in the dict, append it to the
                    # value list
                    edge_attributes[key].append(edge[key])
                else:
                    # if this key doesn't already exist, set the value to a list
                    # containing the one value
                    edge_attributes[key] = [edge[key]]

        for key in edge_attributes:
            # don't touch the length attribute, we'll sum it at the end
            if len(set(edge_attributes[key])) == 1 and not key == "length":
                # if there's only 1 unique value in this attribute list,
                # consolidate it to the single value (the zero-th)
                edge_attributes[key] = edge_attributes[key][0]
            elif not key == "length":
                # otherwise, if there are multiple values, keep one of each value
                edge_attributes[key] = list(set(edge_attributes[key]))

        # construct the geometry and sum the lengths of the segments
        edge_attributes["geometry"] = LineString(
            [Point((G.nodes[node]["x"], G.nodes[node]["y"])) for node in path]
        )
        edge_attributes["length"] = sum(edge_attributes["length"])

        # add the nodes and edges to their lists for processing at the end
        all_nodes_to_remove.extend(path[1:-1])
        all_edges_to_add.append(
            {"origin": path[0], "destination": path[-1], "attr_dict": edge_attributes}
        )

    # for each edge to add in the list we assembled, create a new edge between
    # the origin and destination
    for edge in all_edges_to_add:
        G.add_edge(edge["origin"], edge["destination"], **edge["attr_dict"])

    # finally remove all the interstitial nodes between the new edges
    G.remove_nodes_from(set(all_nodes_to_remove))

    # mark graph as having been simplified
    G.graph["simplified"] = True

    msg = (
        f"Simplified graph: {initial_node_count} to {len(G)} nodes, "
        f"{initial_edge_count} to {len(G.edges)} edges"
    )
    ox.utils.log(msg)
    return G

This level of duplication seems obviously quite silly to me though, and I'd much rather not have to. And to be clear, I'm only proposing changes to the latter two functions—not the inclusion of my custom_is_endpoint function.

Additional context As a quick and dirty demonstration of what this network structure allows me to do, I used the following code to plot the dedicated/undedicated pedestrian network for a Seattle area suburb:

def check_tag_match(G, tag_values={}):
    for e in G.edges:
        for tag, values in tag_values.items():
            edge_data = G.edges[e]
            if tag in edge_data:
                e_tag_values = edge_data[tag]

                if type(e_tag_values) is list:
                    value_match = set(e_tag_values) & set(values)
                else:
                    value_match = e_tag_values in values

                if not value_match:
                    break

        yield value_match

dedicated_types = set(["footway", "steps", "path", "cycleway"])
marked_crossing_types = [
    "marked",
    "zebra",
    "pedestrian_signals",
    "traffic_signals",
    "traffic_lights",
]

dedicated_infrastructure = check_tag_match(
    se_bellevue_cs_G, {"highway": dedicated_types, "crossing": marked_crossing_types}
)

colors = ["tab:green" if match else "#999" for match in dedicated_infrastructure]
ns = [(se_bellevue_cs_G.in_degree[node] == 2) * 10 for node in se_bellevue_cs_G.nodes]
ox.plot_graph(se_bellevue_cs_G, node_size=ns, edge_color=colors, figsize=(20, 20))
plt.show()

Green indicates a sidewalk or marked crosswalk, grey indicates unmarked crosswalk or pedestrian legal street without sidewalks. Note numerous two way intersections where pedestrian infrastructure transitions.

gboeing commented 3 years ago

Hi @rohanaras, thanks for the detailed proposal. Can you identify what specific changes will need to be made? It's a bit tough to pick out individual changes from all the code you've provided. They key thing I'm trying to figure out is if this is proposal can be small, narrowly targeted, and have clear re-use potential for many other folks.

These changes might also require a new G.graph["simplified"] flag and simplification._is_simplified function, though I have not fully thought through how to reimplement these.

Note that the simplification._is_simplified was removed from the codebase recently.

rohanaras commented 3 years ago

Yeah sure—what I'm proposing essentially boils down to the following changes:

simplification. _get_paths_to_simplify would take a new endpoint_func (defaulting to _is_endpoint but that minimally takes a graph G and node n) and iterate through it instead of _is_endpoint. This would also require a change to how the strict argument is used, assuming that any custom endpoint_func isn't required to handle it. I can think of two ways of handling this: the first respects the position of the strict keyword but precludes its use in a custom endpoint_func, the second doesn't respect the current position.
```
def _get_paths_to_simplify(G, strict=True, endpoint_func=_is_endpoint, **kwargs):
if endpoint_func == _is_endpoint:
    kwargs['strict'] = strict
```
```
def _get_paths_to_simplify(G, endpoint_func=_is_endpoint, **kwargs):
if endpoint_func == _is_endpoint and 'strict' not in kwargs:
    kwargs['strict'] = True
```
```
endpoints = set([n for n in G.nodes() if endpoint_func(G, n, **kwargs)])
```

simplification.simplify_graph would also take a new endpoint_func argument and pass this to _get_paths_to_simplify. The above note about the strict keyword applies here too.

def simplify_graph(G, endpoint_func=_is_endpoint, remove_rings=True, **kwargs):
if endpoint_func == _is_endpoint and 'strict' not in kwargs:
    kwargs['strict'] = True

for path in _get_paths_to_simplify(G, endpoint_func=endpoint_func, **kwargs):

It might also make sense to add a field to the graph flagging what endpoint function was used.

gboeing commented 3 years ago

Thanks @rohanaras. This is a bit niche for incorporating into the project so I'm going to close this for now. If this ends up being something that several other users also need, we can revisit it.

csebastiao commented 9 months ago

Hi @gboeing,

Thank you so much for this immensely useful library !

I believe that the feature proposed by @rohanaras is relevant and is close to what I wanted to submit on my side. I am working with others on bicycle networks using OSMnx, and we have observed that a lot of bicycle-related attributes can end abruptly. Because categorical attributes are merged when the network is simplified, it's hard to know where does the bicycle network start and end.

Visually, we need a function that can do this: Before After It is simplifying nodes not giving any additional information while keeping the ones where the edges' colors change.

And while this is critical for bicycle attributes because of their sparsity, this can be true for most attributes. I heard other peers in NetSci/GIS needing the same feature for their own work, and for instance issue #846 shows in another way the broader need to keep some attributes or nodes.

We created some time ago a working (but now deprecated) custom set of functions by tuning the OSMnx simplification (in our custom fork used in this article) but I feel that this might be a useful feature for everyone.

An easy and general way to discriminate attributes while simplifying that I found is by adding another rule to the is_endpoint function, as such:

# rule 5
if attributes is not None:
    for attr in attributes:
        if (
            len(
                {
                    e[-1][attr]
                    for e in list(G.out_edges(node, data=True))
                    + list(G.in_edges(node, data=True))
                }
            )
            > 1
        ):
            return True

If the selected attributes are different between the edges connected to a node, then the node is an endpoint.

Then, by adding an additional optional argument attributes as def _is_endpoint(G, node, strict=True, attributes=None) and to the upstream functions _get_paths_to_simplify and simplify_graph we can discriminate any relevant attribute by putting them as an input in a list. This is a basic working example on simple use cases that can go through the PR process, but it can be refined and I can do a follow-up if needed.

gboeing commented 9 months ago

@csebastiao I'm open to considering this, as it sounds like several people may have found it useful.

Looking back at the original issue, @rohanaras had suggested accepting a user-defined _is_endpoint function as an argument. That does create some burden for the user having to redefine the function. Part of my reasoning for closing it earlier is that if a user does have to custom define a function like this, it seems reasonable to just fork the repo for the custom niche use case.

Your proposal of adding a single argument attributes to the _is_endpoint function may offer something of compromise between niche use case complexity versus an easy-to-use API change for a somewhat common use case. That is, as both you and @rohanaras note, simplifying a bike network model without a feature like this may result in a topology that hinders downstream bike network analytics.

@csebastiao let me confirm first that I understand the issue fully. Current OSMnx "strict" simplification is purely topological: it merges adjacent edges based on incident node degrees (or self-looping). And current OSMnx "non-strict" simplification incorporates a little bit of attribute information (osmid values) when determining whether or not to merge adjacent edges. The problem is that given how bike infrastructure is digitized on OSM, additional attributes are necessary to create an endpoint between two edges denoting where the bike infrastructure begins/ends. Otherwise, you end up with a long merged edge with bike infrastructure ending somewhere in the middle of it. Does this understanding correctly represent both your and @rohanaras's issues?

csebastiao commented 9 months ago

Hi @gboeing, thanks for your fast reply !

That is correct. To give an example, if I extract the street network of Copenhagen (and Frederiksberg), I have without simplification 123241 nodes. With the strict simplification, I am left with 47758 nodes. With the non-strict simplification, I am left with 56804 nodes. With the simplification discriminating for an arbitrary attribute based on cycling OSM tags, I am left with 48275 nodes. As far as I have tested, the set of nodes of the last graph is a subset of the nodes that are in the simplified graph in non-strict mode, which make sense as edges on the same road have different osmids because they have different attributes, so the different cycling attributes are part of them.

The non-strict mode is doing the discrimination of edges/finding endpoints with the osmid attribute, which is not the relevant attribute for our use case. We want to be able to do the same thing but for arbitrary attributes, so some kind of generalization of the non-strict mode. That way we can only keep edges and nodes that are relevant to us based on what we want to study, whatever that is.

anastassiavybornova commented 9 months ago

fully endorsed, that would be very useful. @martinfleis @jGaboardi @dabreegster @lukasballo @Robinlovelace @MicheleTirico tagging you here since we all have discussed this at some point in time... (see the graph plot by @csebastiao above for a quick grasp of what the idea here is - keeping interstititial nodes at graph simplification IF there is an edge attribute change)

jGaboardi commented 9 months ago

@anastassiavybornova Thanks for the ping! Yes, I agree that this feature would be immensely useful for both the stuff we discussed earlier and some other projects I am involved with.

gboeing commented 9 months ago

I'm reopening this issue. @csebastiao thanks for your further details. @anastassiavybornova @jGaboardi thanks for your input -- it sounds like this would be useful for many people.

Let's brainstorm what implementation might look like.

One idea I had was that we could generalize the strict parameter itself. Currently it's a bool: if True, nodes are only retained if they are intersections or dead-ends; if False, nodes are also retained if their incident edges have different OSM IDs (docs). So this does a severely limited version of what you're talking about: it looks at an edge attribute, but only the osmid (in the source code, this is "rule 4" of the _is_endpoint algorithm). Maybe we could generalize the strict parameter:

# current definition
strict: bool = True

# proposed redefinition
strict: str | None = None

The proposed redefinition would accept an edge attribute name as a string. If None, it would maintain current strict=True behavior. If "osmid", it would maintain current strict=False behavior. And if any other edge attribute, it would execute "rule 4" on that attribute instead of on osmid. This would be akin to the "rule 5" @csebastiao proposed above.

If a change like this to the parameter would resolve the issue, we could target the v2 branch. See #1106.

csebastiao commented 9 months ago

I like the idea to generalize rule 4 ! The only difference in behavior is that in the initial proposal I made one could input a list of edge attributes, in case we want to discriminate on multiple attributes. The way I wrote it could actually only handle list-like type as an input, but that can easily be avoided.

To see why it might make sense to discriminate on multiple attributes, let's take as an example a graph $G = (V, E)$ with: $\begin{equation} V = \{i \hspace{0.25em} |\hspace{0.25em} i \in [1, 6] \} \quad \mathrm{and} \quad E = \{ \{i, i+1\} \hspace{0.25em} | \hspace{0.25em} i \in V - \{6\} \} \end{equation}$

This is a straight road of 6 nodes and 5 edges. We give to edges 3 different attributes $\{A, B, C\}$. Attribute A is different on edge {1, 2}, meaning that every other edges have the same value for attribute A. Attribute B is different on edge {3, 4}, attribute C is different on edge {4, 5}. With non-strict simplification, we only keep node {1, 6}, as they are dead-ends. With simplification on A, we keep nodes {1, 2, 6}, since the value is different for the edge {1, 2}. On B, we keep nodes {1, 3, 4, 6}, since the value is different for the edge {3, 4}, so we need to have both nodes to discriminate the attributes' values. On C, we keep nodes {1, 4, 5, 6}. With strict simplification, as osmid is different for any segment where an attribute is different, we keep all nodes (more generally the union of the set of nodes for the simplified graphs with one attribute discriminated).

If one could simplify while discriminating on A and B for instance, we would keep nodes {1, 2, 3, 4, 6}. And any combination of 2 attributes will give a different set of nodes. Discriminating on all attributes is the same as discriminating using osmid. And a simple "merge" of some kind of two attributes is not the same as discriminating separately on both.

Until now I my use cases I usually give as input a single attribute for simplification, one created with the simplification in mind. But it might be useful to be more general, especially since it would not be more complex to the user to be able to give as an input either a string with the name of the attribute or a list of strings.

On a side note, one should be careful to not give attributes that will (almost) always be different, like "geometry" and "length". It might consume a lot of time doing computations for not simplifying a single node. I feel that some warning if a name known to have this behavior is given as an input might be nice.

gboeing commented 9 months ago

Thanks @csebastiao. Yes your reasoning for accepting a list of edge attribute names makes sense.

I was previously thinking it would be simpler to just redefine the existing strict parameter. But now it seems like that may be more confusing than anything else (especially since the name would be a bit of a misnomer). Probably best to deprecate the strict parameter and replace it with a new attrs parameter defined like:

# proposed parameter
attrs: list[str] | None = None

If attrs=None, it would maintain current strict=True behavior. If attrs=["osmid"], it would maintain current strict=False behavior. And if attrs= any other edge attribute(s), it would execute rule 4 on that/those attribute(s).

So then the new generalized rule 4 could look like:

# proposed new rule 4
if attrs is not None:
    for attr in attrs:
        in_values = {v for _, _, v in G.in_edges(node, data=attr, keys=False)}
        out_values = {v for _, _, v in G.out_edges(node, data=attr, keys=False)}
        if len(in_values | out_values) > 1:
            return True

And a MWE to test it:

import osmnx as ox

def test_rule4(G, node, attrs):
    if attrs is not None:
        for attr in attrs:
            in_values = {v for _, _, v in G.in_edges(node, data=attr, keys=False)}
            out_values = {v for _, _, v in G.out_edges(node, data=attr, keys=False)}
            if len(in_values | out_values) > 1:
                return True
    return False

# test the rule
G = ox.graph_from_place("Piedmont, CA, USA", network_type="drive", simplify=False)
print(test_rule4(G, node=53021742, attrs=None))  # False
print(test_rule4(G, node=53021742, attrs=["osmid"]))  # True
print(test_rule4(G, node=53021742, attrs=["highway"]))  # False
print(test_rule4(G, node=53021742, attrs=["name"]))  # True
print(test_rule4(G, node=53021742, attrs=["name", "highway"]))  # True

Comments or suggestions?

csebastiao commented 9 months ago

This new rule 4 is solving every issue I had @gboeing, I have nothing more to add !

gboeing commented 9 months ago

Any final comments or feedback from anyone else here? @rohanaras @jGaboardi @anastassiavybornova ?

Robinlovelace commented 9 months ago

Comment from me coming at this with little context: how to get this working? Minimal reproducible example would help, happy to test.

jGaboardi commented 9 months ago

Nothing from me. That new rule4 is pretty spiffy, and seems to be incredibly flexible for retaining all network attributes (if desired) during simplification.

jGaboardi commented 9 months ago

@martinfleis Anything from you?

rohanaras commented 9 months ago

This is definitely a lot more elegant than what I had originally proposed! I believe it would work for my old use case.

My one concern is that attrs=["osmid"] feels obscure compared to the existing strict=False. But that could just be a status quo bias. And it's probably best addressed with documentation of the change.

rohanaras commented 9 months ago

Thinking about this a little bit more—is there perhaps a better name for attrs that highlights the fact that it controls preventing/short-circuiting simplification?

A good enough name could be more easily interpretable than the current strict keyword, honestly.

Another thing to consider is G.graph["simplified"]. I imagine that it would make sense to replace this flag with the list of attrs that the graph simplified up to.

This check might also deserve a rethink—should a user be allowed to further simply a graph? (Assuming that there are nodes that could be further simplified under the current "strict" regime)

if "simplified" in G.graph and G.graph["simplified"]:  # pragma: no cover
    msg = "This graph has already been simplified, cannot simplify it again."
    raise GraphSimplificationError(msg)

gboeing commented 9 months ago

Thinking about this a little bit more—is there perhaps a better name for attrs that highlights the fact that it controls preventing/short-circuiting simplification?

Perhaps endpoint_attrs would be both concise and specific?

rohanaras commented 9 months ago

endpoint_attrs makes sense to me.

gboeing commented 9 months ago

Proposed resolution in #1117.

gboeing commented 9 months ago

Comment from me coming at this with little context: how to get this working? Minimal reproducible example would help, happy to test.

@Robinlovelace see https://github.com/gboeing/osmnx/pull/1117#issuecomment-1914197066

gboeing commented 9 months ago

Another thing to consider is G.graph["simplified"]. I imagine that it would make sense to replace this flag with the list of attrs that the graph simplified up to.

This check might also deserve a rethink—should a user be allowed to further simply a graph? (Assuming that there are nodes that could be further simplified under the current "strict" regime)

@rohanaras the simplification algorithm only works once. It would have to be wholly rethought to allow a second simplification to occur on a simplified graph, and I'm not sure that would even be useful: especially with #1117, users should be able to have relatively fine control over how simplification proceeds. Hence the check is really just a guard against inscrutable errors during re-simplification, to let the user know in a more straightforward way that it can only happen once.

gboeing commented 9 months ago

This enhancement has been released in v1.9.

gboeing commented 7 months ago

Note that #1145 additionally adds a node_attrs_include param to the simplify_graph function to further flexibly relax graph simplification strictness. It also renames the endpoint_attrs param to edge_attrs_differ for consistent and clear naming, given the new param.

gboeing / osmnx

Customizable Graph Simplification #625