UDST / pandana

Pandas Network Analysis by UrbanSim: fast accessibility metrics and shortest paths, using contraction hierarchies :world_map:
http://udst.github.io/pandana
GNU Affero General Public License v3.0
386 stars 84 forks source link

`ValueError: Buffer dtype mismatch` when construcing Network from pandas dataframe #88

Open double-u-a opened 7 years ago

double-u-a commented 7 years ago

Description of the bug

I cannot use pdna.Network() with my own pandas edge and node dataframes (built from an sql query), there is some sort of type mismatch saying it is getting double when it is expecting long, but all values from the dataframes are integer so I can't pinpoint where this is coming from. The tutorial says the node_x node_y and weight values should be float in any case.

This error doesn't come up when I am using osm.pdna_network_from_bbox() or I use your example osm_bayarea.h5 data and they even have float numbers, so I am assuming there is a specific way to construct the node and edge dataframes so they can be used by pdna.Network?

Network data (optional)

Our network is large and sits on a sql database so I'll just show the structure here. I've input a edge dataframe in this format (bigint from a pandas sql query):

from to weight
1534152 1533645 839
1534051 1533659 1644
1534016 1534015 200
1534024 1534016 758
1534013 1534016 313

And the node data was in this format (bigint from a pandas sql query):

id x y
1539680 486522 240589
1539682 486522 240376
1539683 486531 240399
1539684 486540 240513
1539686 486563 240392

I also tried making sure the dtype of the data series matched exactly the osm_bayarea.h5 data but I also got the same error. Edges

id
1840193    1534152
1840213    1534051
1855844    1534016
1855845    1534024
1855841    1534013
Name: from, dtype: int64
id
1840193    1533645
1840213    1533659
1855844    1534015
1855845    1534016
1855841    1534016
Name: to, dtype: int64
id
1840193     839.0
1840213    1644.0
1855844     200.0
1855845     758.0
1855841     313.0
Name: weight, dtype: float32

Nodes

id
1539680    486522.0
1539682    486522.0
1539683    486531.0
1539684    486540.0
1539686    486563.0
Name: x, dtype: float64
id
1539680    240589.0
1539682    240376.0
1539683    240399.0
1539684    240513.0
1539686    240392.0
Name: y, dtype: float64

The only significant difference is that the network is cropped from a larger graph we have, so the node ids don't start from 0 but from an arbitrary point, but I don't know if that affects this.

Thank you very much for your hard work on this package, it is very appreciated and I hope I can help.

Environment

Paste the code that reproduces the issue here:

net=pdna.Network(nodes["x"], 
                 nodes["y"],
                 edges["from"], 
                 edges["to"],
                 edges[["weight"]])

Paste the error message (if applicable):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-12c11d512036> in <module>()
      3                  edges["from"],
      4                  edges["to"],
----> 5                  edges[["weight"]])

~/anaconda3/envs/test-environment/lib/python3.5/site-packages/pandana/network.py in __init__(self, node_x, node_y, edge_from, edge_to, edge_weights, twoway)
     84                                                           .astype('double')
     85                                                           .as_matrix(),
---> 86                             twoway)
     87 
     88         self._twoway = twoway

ana/src/cyaccess.pyx in pandana.cyaccess.cyaccess.__cinit__ (src/cyaccess.cpp:2186)()

ValueError: Buffer dtype mismatch, expected 'long' but got 'double'
fscottfoti commented 7 years ago

The node ids have to be ints, so I'm guessing that for nodes["x"] and/or nodes["y"] the index (not the values) is of type double but should be of type long. Lemme know if that helps.

double-u-a commented 7 years ago

Thanks for the suggestion, I tested by including:

print(edges.index.dtype)
print(edges["from"].index.dtype)
print(edges["to"].index.dtype)
print(edges["weight"].index.dtype)
print(nodes.index.dtype)
print(nodes["x"].index.dtype)
print(nodes["y"].index.dtype)

And I got int64 printed for all of them. So my dataframe matches the osm_bayarea.h5 for all dtypes for index and columns. However the osm_bayarea.h5 data works fine with pdna.network(), whereas my dataframe returns the dtype mismatch error.

fscottfoti commented 7 years ago

Hmm, from looking at the code, it's most likely with your edges. You might want to recreate this line of code with your data and see what the type of the resulting index is...

edges_df = pd.DataFrame({'from': edges["from"], 'to': edges["to"]}).join(edge_weights)

double-u-a commented 7 years ago

Okay I checked it like this:

edge_weights = edges["weight"]
edges_df = pd.DataFrame({'from': edges["from"], 'to': edges["to"]}).join(edge_weights)

print(edges_df.index.dtype)
print(edges_df['from'].index.dtype)
print(edges_df['to'].index.dtype)
print(edge_weights.index.dtype)

Returns int64 for all 4 indexes, this is the same for the osm_bayarea.h5 data too.

The only other thing I can say is different is that the x, y coordinates and weights are just integers turned into floats to fit the API docs (i.e.486540.0), but I'm not sure if that relates.

fscottfoti commented 7 years ago

Not sure on this one. My guess is it's something fairly simple we're missing. Might need sample data and sample code to diagnose it...

double-u-a commented 7 years ago

Agreed, let me do some internal testing with different sample data from different sources (I've only had tried this with the sql derived dataframe) and I'll get back to you either way. Thanks very much for your help!

lmnoel commented 6 years ago

Hello, I'm having the same issue and just found this thread. Was the problem ever resolved?

sablanchard commented 6 years ago

@double-u-a we wanted to check in on this to see if you had any updates: https://github.com/UDST/pandana/issues/88#issuecomment-318433914 its been awhile.

double-u-a commented 6 years ago

@sablanchard Yes I have been really meaning to get back to this, we've been busy completely rebuilding our geodatabase so I haven't had the opportunity to create the sample datasets for testing/reproduction of the issue. Fortunately the datasets should be ready in the next week or two. @lmnoel if you have some test data that reproduces the issue already then please do share in the meantime 👍

lmnoel commented 6 years ago

I'm trying to merge external data with a set of edges/nodes data frames returned from osm.network_from_bbox(), and from my testing, the mere act of concatenating a single row (with each column matching the dtype of the osm.network_from_bbox() DF's precisely) produces this error. @double-u-a @sablanchard

Edit: I think I have solved my issue. It turned out there was an issue with how I was constructing my DF to merge with the osm.network_from_bbox() DF, such that not every node in the edges DF was contained in the nodes DF. An explicit check/warning for this in the Net constructor might be helpful.

lmnoel commented 6 years ago

I suggest something to the effect of the following line be added to the network constructor:


assert len((set([i[0] for i in edge_from.index] + [i[1] for i in edge_from.index])) - set(node_x.index)) <= 0, "Error: edges contain unspecified nodes"```
double-u-a commented 6 years ago

Hello, many apologies for the delay in this, I've rebuilt my geodb with fresh data and replicating the error. As per @lmnoel I've done a check to see if my edge and node sets are matching and as far as I can tell the nodes and edges are all matching. The data is being retrieved from a pgsql db via pandas, and the data types match the example data and what osm.network_from_bbox() builds.

I can confirm running pandas.to_csv and then pandas.read_csv seems to make the dataframe work without error when running pdna.Network, so something in pandas sql derived dataframe is causing an error despite the correct dtypes. At this point it may well be a bug in pandas for all I know.

wa-bhe commented 5 years ago

Hello again!

This problem keeps coming up when I use the library, so I've worked on a self contained example that replicates the error.

Also note deprecation warnings in log at bottom.


# declare graph as dictionary

edge_dict = {
 'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 'id_node_source': {0: 1, 1: 1, 2: 2, 3: 3, 4: 4},
 'id_node_target': {0: 4, 1: 2, 2: 4, 3: 4, 4: 2},
 'distance': {0: 355.91725215477004,
  1: 339.0527044990422,
  2: 542.0301068103291,
  3: 405.7927520128794,
  4: 698.3406580590387}}

node_dict = {
 'id_node': {0: 1, 2: 2, 3: 3, 4: 4},
 'x': {0: 523991.2039019342,
  2: 524221.758848412,
  3: 523816.78407285974,
  4: 524193.69128971046},
 'y': {0: 2944562.7472850494,
  2: 2944811.345662121,
  3: 2944420.40466592,
  4: 2944270.042744304}}

# read dictionary into dataframe
edges_topo = pd.DataFrame.from_dict(edge_dict)
nodes_gdf = pd.DataFrame.from_dict(node_dict)

net = pdna.Network(node_x = nodes_gdf["x"],
                   node_y = nodes_gdf["y"],
                   edge_from = edges_topo["id_node_source"], 
                   edge_to = edges_topo["id_node_target"],
                   edge_weights = edges_topo[["distance"]])

C:\Apps\Anaconda\envs\ium\lib\site-packages\pandana\network.py:82: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  nodes_df.astype('double').as_matrix(),
C:\Apps\Anaconda\envs\ium\lib\site-packages\pandana\network.py:83: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  edges.as_matrix(),
C:\Apps\Anaconda\envs\ium\lib\site-packages\pandana\network.py:85: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  .astype('double')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-828612a5b386> in <module>
      5                    edge_from = edges_topo["id_node_source"],
      6                    edge_to = edges_topo["id_node_target"],
----> 7                    edge_weights = edges_topo[["distance"]])

C:\Apps\Anaconda\envs\ium\lib\site-packages\pandana\network.py in __init__(self, node_x, node_y, edge_from, edge_to, edge_weights, twoway)
     85                                                           .astype('double')
     86                                                           .as_matrix(),
---> 87                             twoway)
     88 
     89         self._twoway = twoway

src\cyaccess.pyx in pandana.cyaccess.cyaccess.__cinit__()

ValueError: Buffer dtype mismatch, expected 'long' but got 'double'
semcogli commented 5 years ago

@wa-bhe , could you set nodes_gdf index to "id_node" then try it again? My understanding is that nodes DF need to be properly indexed to work.

wa-bhe commented 5 years ago

Adding an index as you suggested @semcogli creates a Network dataframe successfully. Many thanks!

nodes_gdf.set_index('id_node', inplace= True)

That does make sense, given that the function is expecting a graph created by osmnet.

I could make a PR on the docs to add a generic geodataframe loading section, specifying that the node layer needs to be indexed by the node id?

nicholasmartino commented 3 years ago

I recently had the same problem and I figured it was because there were edges referencing non-existing nodes. I fixed by filtering the edges_gdf using this line of code:

edges_gdf = edges_gdf[edges_gdf['to'].isin(nodes_gdf['id_node']) & edges_gdf['from'].isin(nodes_gdf['id_node'])]
bstabler commented 3 years ago

nodes.set_index('ID', inplace= True) worked for me, thanks