Closed rohanaras closed 9 months ago
Hi @rohanaras, thanks for the detailed proposal. Can you identify what specific changes will need to be made? It's a bit tough to pick out individual changes from all the code you've provided. They key thing I'm trying to figure out is if this is proposal can be small, narrowly targeted, and have clear re-use potential for many other folks.
These changes might also require a new G.graph["simplified"] flag and simplification._is_simplified function, though I have not fully thought through how to reimplement these.
Note that the simplification._is_simplified
was removed from the codebase recently.
Yeah sure—what I'm proposing essentially boils down to the following changes:
simplification. _get_paths_to_simplify
would take a new endpoint_func
(defaulting to _is_endpoint
but that minimally takes a graph G
and node n
) and iterate through it instead of _is_endpoint
. This would also require a change to how the strict
argument is used, assuming that any custom endpoint_func
isn't required to handle it. I can think of two ways of handling this: the first respects the position of the strict
keyword but precludes its use in a custom endpoint_func
, the second doesn't respect the current position.
def _get_paths_to_simplify(G, strict=True, endpoint_func=_is_endpoint, **kwargs):
if endpoint_func == _is_endpoint:
kwargs['strict'] = strict
def _get_paths_to_simplify(G, endpoint_func=_is_endpoint, **kwargs):
if endpoint_func == _is_endpoint and 'strict' not in kwargs:
kwargs['strict'] = True
endpoints = set([n for n in G.nodes() if endpoint_func(G, n, **kwargs)])
simplification.simplify_graph
would also take a new endpoint_func
argument and pass this to _get_paths_to_simplify
. The above note about the strict
keyword applies here too.
def simplify_graph(G, endpoint_func=_is_endpoint, remove_rings=True, **kwargs):
if endpoint_func == _is_endpoint and 'strict' not in kwargs:
kwargs['strict'] = True
for path in _get_paths_to_simplify(G, endpoint_func=endpoint_func, **kwargs):
It might also make sense to add a field to the graph flagging what endpoint function was used.
Thanks @rohanaras. This is a bit niche for incorporating into the project so I'm going to close this for now. If this ends up being something that several other users also need, we can revisit it.
Hi @gboeing,
Thank you so much for this immensely useful library !
I believe that the feature proposed by @rohanaras is relevant and is close to what I wanted to submit on my side. I am working with others on bicycle networks using OSMnx, and we have observed that a lot of bicycle-related attributes can end abruptly. Because categorical attributes are merged when the network is simplified, it's hard to know where does the bicycle network start and end.
Visually, we need a function that can do this: It is simplifying nodes not giving any additional information while keeping the ones where the edges' colors change.
And while this is critical for bicycle attributes because of their sparsity, this can be true for most attributes. I heard other peers in NetSci/GIS needing the same feature for their own work, and for instance issue #846 shows in another way the broader need to keep some attributes or nodes.
We created some time ago a working (but now deprecated) custom set of functions by tuning the OSMnx simplification (in our custom fork used in this article) but I feel that this might be a useful feature for everyone.
An easy and general way to discriminate attributes while simplifying that I found is by adding another rule to the is_endpoint
function, as such:
# rule 5
if attributes is not None:
for attr in attributes:
if (
len(
{
e[-1][attr]
for e in list(G.out_edges(node, data=True))
+ list(G.in_edges(node, data=True))
}
)
> 1
):
return True
If the selected attributes are different between the edges connected to a node, then the node is an endpoint.
Then, by adding an additional optional argument attributes as def _is_endpoint(G, node, strict=True, attributes=None)
and to the upstream functions _get_paths_to_simplify
and simplify_graph
we can discriminate any relevant attribute by putting them as an input in a list. This is a basic working example on simple use cases that can go through the PR process, but it can be refined and I can do a follow-up if needed.
@csebastiao I'm open to considering this, as it sounds like several people may have found it useful.
Looking back at the original issue, @rohanaras had suggested accepting a user-defined _is_endpoint
function as an argument. That does create some burden for the user having to redefine the function. Part of my reasoning for closing it earlier is that if a user does have to custom define a function like this, it seems reasonable to just fork the repo for the custom niche use case.
Your proposal of adding a single argument attributes
to the _is_endpoint
function may offer something of compromise between niche use case complexity versus an easy-to-use API change for a somewhat common use case. That is, as both you and @rohanaras note, simplifying a bike network model without a feature like this may result in a topology that hinders downstream bike network analytics.
@csebastiao let me confirm first that I understand the issue fully. Current OSMnx "strict" simplification is purely topological: it merges adjacent edges based on incident node degrees (or self-looping). And current OSMnx "non-strict" simplification incorporates a little bit of attribute information (osmid values) when determining whether or not to merge adjacent edges. The problem is that given how bike infrastructure is digitized on OSM, additional attributes are necessary to create an endpoint between two edges denoting where the bike infrastructure begins/ends. Otherwise, you end up with a long merged edge with bike infrastructure ending somewhere in the middle of it. Does this understanding correctly represent both your and @rohanaras's issues?
Hi @gboeing, thanks for your fast reply !
That is correct. To give an example, if I extract the street network of Copenhagen (and Frederiksberg), I have without simplification 123241 nodes. With the strict simplification, I am left with 47758 nodes. With the non-strict simplification, I am left with 56804 nodes. With the simplification discriminating for an arbitrary attribute based on cycling OSM tags, I am left with 48275 nodes. As far as I have tested, the set of nodes of the last graph is a subset of the nodes that are in the simplified graph in non-strict mode, which make sense as edges on the same road have different osmids because they have different attributes, so the different cycling attributes are part of them.
The non-strict mode is doing the discrimination of edges/finding endpoints with the osmid attribute, which is not the relevant attribute for our use case. We want to be able to do the same thing but for arbitrary attributes, so some kind of generalization of the non-strict mode. That way we can only keep edges and nodes that are relevant to us based on what we want to study, whatever that is.
fully endorsed, that would be very useful. @martinfleis @jGaboardi @dabreegster @lukasballo @Robinlovelace @MicheleTirico tagging you here since we all have discussed this at some point in time... (see the graph plot by @csebastiao above for a quick grasp of what the idea here is - keeping interstititial nodes at graph simplification IF there is an edge attribute change)
@anastassiavybornova Thanks for the ping! Yes, I agree that this feature would be immensely useful for both the stuff we discussed earlier and some other projects I am involved with.
I'm reopening this issue. @csebastiao thanks for your further details. @anastassiavybornova @jGaboardi thanks for your input -- it sounds like this would be useful for many people.
Let's brainstorm what implementation might look like.
One idea I had was that we could generalize the strict
parameter itself. Currently it's a bool: if True, nodes are only retained if they are intersections or dead-ends; if False, nodes are also retained if their incident edges have different OSM IDs (docs). So this does a severely limited version of what you're talking about: it looks at an edge attribute, but only the osmid
(in the source code, this is "rule 4" of the _is_endpoint
algorithm). Maybe we could generalize the strict
parameter:
# current definition
strict: bool = True
# proposed redefinition
strict: str | None = None
The proposed redefinition would accept an edge attribute name as a string. If None
, it would maintain current strict=True
behavior. If "osmid"
, it would maintain current strict=False
behavior. And if any other edge attribute, it would execute "rule 4" on that attribute instead of on osmid
. This would be akin to the "rule 5" @csebastiao proposed above.
If a change like this to the parameter would resolve the issue, we could target the v2 branch. See #1106.
I like the idea to generalize rule 4 ! The only difference in behavior is that in the initial proposal I made one could input a list of edge attributes, in case we want to discriminate on multiple attributes. The way I wrote it could actually only handle list-like type as an input, but that can easily be avoided.
To see why it might make sense to discriminate on multiple attributes, let's take as an example a graph $G = (V, E)$ with: $\begin{equation} V = \{i \hspace{0.25em} |\hspace{0.25em} i \in [1, 6] \} \quad \mathrm{and} \quad E = \{ \{i, i+1\} \hspace{0.25em} | \hspace{0.25em} i \in V - \{6\} \} \end{equation}
$
This is a straight road of 6 nodes and 5 edges. We give to edges 3 different attributes $\{A, B, C\}
$. Attribute A is different on edge {1, 2}, meaning that every other edges have the same value for attribute A. Attribute B is different on edge {3, 4}, attribute C is different on edge {4, 5}. With non-strict simplification, we only keep node {1, 6}, as they are dead-ends. With simplification on A, we keep nodes {1, 2, 6}, since the value is different for the edge {1, 2}. On B, we keep nodes {1, 3, 4, 6}, since the value is different for the edge {3, 4}, so we need to have both nodes to discriminate the attributes' values. On C, we keep nodes {1, 4, 5, 6}. With strict simplification, as osmid is different for any segment where an attribute is different, we keep all nodes (more generally the union of the set of nodes for the simplified graphs with one attribute discriminated).
If one could simplify while discriminating on A and B for instance, we would keep nodes {1, 2, 3, 4, 6}. And any combination of 2 attributes will give a different set of nodes. Discriminating on all attributes is the same as discriminating using osmid. And a simple "merge" of some kind of two attributes is not the same as discriminating separately on both.
Until now I my use cases I usually give as input a single attribute for simplification, one created with the simplification in mind. But it might be useful to be more general, especially since it would not be more complex to the user to be able to give as an input either a string with the name of the attribute or a list of strings.
On a side note, one should be careful to not give attributes that will (almost) always be different, like "geometry" and "length". It might consume a lot of time doing computations for not simplifying a single node. I feel that some warning if a name known to have this behavior is given as an input might be nice.
Thanks @csebastiao. Yes your reasoning for accepting a list of edge attribute names makes sense.
I was previously thinking it would be simpler to just redefine the existing strict
parameter. But now it seems like that may be more confusing than anything else (especially since the name would be a bit of a misnomer). Probably best to deprecate the strict
parameter and replace it with a new attrs
parameter defined like:
# proposed parameter
attrs: list[str] | None = None
If attrs=None
, it would maintain current strict=True
behavior. If attrs=["osmid"]
, it would maintain current strict=False
behavior. And if attrs=
any other edge attribute(s), it would execute rule 4 on that/those attribute(s).
So then the new generalized rule 4 could look like:
# proposed new rule 4
if attrs is not None:
for attr in attrs:
in_values = {v for _, _, v in G.in_edges(node, data=attr, keys=False)}
out_values = {v for _, _, v in G.out_edges(node, data=attr, keys=False)}
if len(in_values | out_values) > 1:
return True
And a MWE to test it:
import osmnx as ox
def test_rule4(G, node, attrs):
if attrs is not None:
for attr in attrs:
in_values = {v for _, _, v in G.in_edges(node, data=attr, keys=False)}
out_values = {v for _, _, v in G.out_edges(node, data=attr, keys=False)}
if len(in_values | out_values) > 1:
return True
return False
# test the rule
G = ox.graph_from_place("Piedmont, CA, USA", network_type="drive", simplify=False)
print(test_rule4(G, node=53021742, attrs=None)) # False
print(test_rule4(G, node=53021742, attrs=["osmid"])) # True
print(test_rule4(G, node=53021742, attrs=["highway"])) # False
print(test_rule4(G, node=53021742, attrs=["name"])) # True
print(test_rule4(G, node=53021742, attrs=["name", "highway"])) # True
Comments or suggestions?
This new rule 4 is solving every issue I had @gboeing, I have nothing more to add !
Any final comments or feedback from anyone else here? @rohanaras @jGaboardi @anastassiavybornova ?
Comment from me coming at this with little context: how to get this working? Minimal reproducible example would help, happy to test.
Nothing from me. That new rule4
is pretty spiffy, and seems to be incredibly flexible for retaining all network attributes (if desired) during simplification.
@martinfleis Anything from you?
This is definitely a lot more elegant than what I had originally proposed! I believe it would work for my old use case.
My one concern is that attrs=["osmid"]
feels obscure compared to the existing strict=False
. But that could just be a status quo bias. And it's probably best addressed with documentation of the change.
Thinking about this a little bit more—is there perhaps a better name for attrs
that highlights the fact that it controls preventing/short-circuiting simplification?
A good enough name could be more easily interpretable than the current strict
keyword, honestly.
Another thing to consider is G.graph["simplified"]
. I imagine that it would make sense to replace this flag with the list of attrs that the graph simplified up to.
This check might also deserve a rethink—should a user be allowed to further simply a graph? (Assuming that there are nodes that could be further simplified under the current "strict" regime)
if "simplified" in G.graph and G.graph["simplified"]: # pragma: no cover
msg = "This graph has already been simplified, cannot simplify it again."
raise GraphSimplificationError(msg)
Thinking about this a little bit more—is there perhaps a better name for attrs that highlights the fact that it controls preventing/short-circuiting simplification?
Perhaps endpoint_attrs
would be both concise and specific?
endpoint_attrs
makes sense to me.
Proposed resolution in #1117.
Comment from me coming at this with little context: how to get this working? Minimal reproducible example would help, happy to test.
@Robinlovelace see https://github.com/gboeing/osmnx/pull/1117#issuecomment-1914197066
Another thing to consider is G.graph["simplified"]. I imagine that it would make sense to replace this flag with the list of attrs that the graph simplified up to.
This check might also deserve a rethink—should a user be allowed to further simply a graph? (Assuming that there are nodes that could be further simplified under the current "strict" regime)
@rohanaras the simplification algorithm only works once. It would have to be wholly rethought to allow a second simplification to occur on a simplified graph, and I'm not sure that would even be useful: especially with #1117, users should be able to have relatively fine control over how simplification proceeds. Hence the check is really just a guard against inscrutable errors during re-simplification, to let the user know in a more straightforward way that it can only happen once.
This enhancement has been released in v1.9.
Note that #1145 additionally adds a node_attrs_include
param to the simplify_graph
function to further flexibly relax graph simplification strictness. It also renames the endpoint_attrs
param to edge_attrs_differ
for consistent and clear naming, given the new param.
Is your feature proposal related to a problem?
I've been working on a side project with pedestrian/sidewalk networks recently (and some bike, but mostly ped) and have found myself in situations where it would be useful to simply the network down from the unsimplified OSM network but the existing simplification function is far too strict. In particular, I would like to be able to leave nodes in places where the infrastructure in an area switches from dedicated to non-dedicated. By doing this, nodes capture all of the places where someone walking along the graph has to make a decision—not only to go left or right at an intersection, but also the decision to turn around because (oh no) the sidewalk ends or the trail doesn't have a marked crosswalk across a busy road.
Describe the solution you'd like to propose
I would like to be able to be able to pass a custom
_is_endpoint
function tosimplification.simplify_graph
(andsimplification._get_paths_to_simplify
as a result) as a parameter. The existingstrict
argument could be left as is or appended to a kwargs dictionary if the chosen function is the default.These changes might also require a new
G.graph["simplified"]
flag andsimplification._is_simplified
function, though I have not fully thought through how to reimplement these. My initial impression is that this line could be modified by replacingTrue
with the function name and the second condition in this line could simply be removed.Describe alternatives you've considered
I do have the functionality I want as it is, but it's required me to fully duplicate the
simplify_graph
and_get_paths_to_simplify
functions in their near entirety. My current setup looks like this:This level of duplication seems obviously quite silly to me though, and I'd much rather not have to. And to be clear, I'm only proposing changes to the latter two functions—not the inclusion of my
custom_is_endpoint
function.Additional context As a quick and dirty demonstration of what this network structure allows me to do, I used the following code to plot the dedicated/undedicated pedestrian network for a Seattle area suburb:
Green indicates a sidewalk or marked crosswalk, grey indicates unmarked crosswalk or pedestrian legal street without sidewalks. Note numerous two way intersections where pedestrian infrastructure transitions.