blaze / odo

Data Migration for the Blaze Project
http://odo.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
1k stars 138 forks source link

HDFStore -> SQL not supported #63

Closed jiffyclub closed 9 years ago

jiffyclub commented 9 years ago

There doesn't seem to be a way to get from an HDFStore to SQL, though you can go from an HDFStore to a Dataframe and a DataFrame to SQL.

This fails:

blz.into('sqlite:///bayarea.db::buildings', 'hdfstore://bayarea.h5::/buildings')

with this error:

NetworkXNoPath: node <class 'pandas.io.pytables.FrameFixed'> not reachable from <class '_abcoll.Iterator'>

But this works fine:

blz.into(pd.DataFrame, 'hdfstore://bayarea.h5::/buildings')

And this works fine:

buildings = store['buildings']
blz.into('sqlite:///bayarea.db::buildings', buildings)
mrocklin commented 9 years ago

Hrm, this works for me

In [1]: from into import into, discover, pd

In [2]: df = pd.DataFrame([['Alice',   100],
                   ['Bob',     200],
                   ['Charlie', 300]], columns=['name', 'balance'])

In [3]: into('hdfstore://foo.h5::accounts', df)
Out[3]: frame_table  (typ->appendable,nrows->3,ncols->2,indexers->[index])

In [4]: into('sqlite:///foo.db::accounts', 'hdfstore://foo.h5::accounts')
Out[4]: Table('accounts', MetaData(bind=Engine(sqlite:///foo.db)), Column('name', Text(), table=<accounts>), Column('balance', BigInteger(), table=<accounts>, nullable=False), schema=None)

Possibly it's running in to an error within the network, trying to reroute, and then running out of available paths.

Can you try

  1. Sharing the output of blz.discover(blz.resource('hdfstore://bayarea.h5::/buildings'))
  2. Raising errors blz.into('sqlite:///bayarea.db::buildings', 'hdfstore://bayarea.h5::/buildings', raise_on_errors=True)
jiffyclub commented 9 years ago
blz.discover(blz.resource('hdfstore://bayarea.h5::/buildings'))
dshape("""1936259 * {
  parcel_id: int32,
  building_type: ?string,
  residential_units: ?float32,
  non_residential_sqft: ?float32,
  building_sqft: ?float32,
  stories: ?float32,
  building_type_id: ?float32,
  year_built: ?float32,
  tenure: int32,
  id: int32,
  building: int32,
  scenario: int32,
  county: ?string,
  lot_size: ?float32,
  rent: ?float64,
  rental: bool,
  general_type: ?string,
  unit_sqft: ?float32,
  unit_lot_size: ?float32,
  x: ?float32,
  y: ?float32,
  _node_id0: int32,
  _node_id1: int32,
  _node_id2: int32,
  _node_id: int32
  }""")
jiffyclub commented 9 years ago
blz.into(
    'sqlite:///bayarea.db::buildings', 
    'hdfstore://bayarea.h5::/buildings', 
    raise_on_errors=True)

Gives the same error as without raise_on_errors:

---------------------------------------------------------------------------
NetworkXNoPath                            Traceback (most recent call last)
<ipython-input-11-3ca44359a7e0> in <module>()
      2     'sqlite:///bayarea.db::buildings',
      3     'hdfstore://bayarea.h5::/buildings',
----> 4     raise_on_errors=True)

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/multipledispatch/dispatcher.pyc in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/into/into.pyc in into_string_string(a, b, **kwargs)
     78 def into_string_string(a, b, **kwargs):
     79     r = resource(b, **kwargs)
---> 80     return into(a, r, **kwargs)
     81 
     82 

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/multipledispatch/dispatcher.pyc in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/into/into.pyc in into_string(uri, b, **kwargs)
     72 
     73     a = resource(uri, dshape=resource_ds, expected_dshape=ds, **kwargs)
---> 74     return into(a, b, dshape=ds, **kwargs)
     75 
     76 

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/multipledispatch/dispatcher.pyc in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/into/into.pyc in into_object(a, b, **kwargs)
     58         if 'dshape' not in kwargs:
     59             kwargs['dshape'] = discover(b)
---> 60     return append(a, b, **kwargs)
     61 
     62 

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/multipledispatch/dispatcher.pyc in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/into/backends/sql.pyc in append_anything_to_sql_Table(t, o, **kwargs)
    301 @append.register(sa.Table, object)
    302 def append_anything_to_sql_Table(t, o, **kwargs):
--> 303     return append(t, convert(Iterator, o, **kwargs), **kwargs)
    304 
    305 

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/into/core.pyc in __call__(self, *args, **kwargs)
     27 
     28     def __call__(self, *args, **kwargs):
---> 29         return _transform(self.graph, *args, **kwargs)
     30 
     31 

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/into/core.pyc in _transform(graph, target, source, excluded_edges, ooc_types, **kwargs)
     40     pth = path(graph, type(source), target,
     41                excluded_edges=excluded_edges,
---> 42                ooc_types=ooc_types)
     43     try:
     44         for (A, B, f) in pth:

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/into/core.pyc in path(graph, source, target, excluded_edges, ooc_types)
     77             graph = graph.subgraph([n for n in graph.nodes() if issubclass(n, oocs)])
     78     with without_edges(graph, excluded_edges) as g:
---> 79         pth = nx.shortest_path(g, source=source, target=target, weight='cost')
     80         result = [(source, target, graph.edge[source][target]['func'])
     81                     for source, target in zip(pth, pth[1:])]

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/networkx/algorithms/shortest_paths/generic.pyc in shortest_path(G, source, target, weight)
    136                 paths=nx.bidirectional_shortest_path(G,source,target)
    137             else:
--> 138                 paths=nx.dijkstra_path(G,source,target,weight)
    139 
    140     return paths

/Users/jiffyclub/miniconda3/envs/blaze/lib/python2.7/site-packages/networkx/algorithms/shortest_paths/weighted.pyc in dijkstra_path(G, source, target, weight)
     80     except KeyError:
     81         raise nx.NetworkXNoPath(
---> 82             "node %s not reachable from %s" % (source, target))
     83 
     84 

NetworkXNoPath: node <class 'pandas.io.pytables.FrameFixed'> not reachable from <class '_abcoll.Iterator'>
jiffyclub commented 9 years ago

One thing to note I guess is that the HDFStore I'm working with was not made by into, but using pandas.HDFStore.

mrocklin commented 9 years ago

All HDFStores are made with pandas. It's a pandas-only format. into just calls pandas to deal with it.

So far nothing seems amiss. Your datatypes are actually relatively tame. Any chance I could actually have a sample of the data? (totally fine if this isn't possible). Alternatively can you reproduce the error on a publicly accessible dataset?

mrocklin commented 9 years ago

BTW, if you want to be friendlier with other ecosystems you might want to try Panda's format= keyword.

df.to_hdf(path, key, format='table')

This is both appendable and a more widely recognized format.

jiffyclub commented 9 years ago

I can reproduce the error with something as simple as:

import pandas as pd
from into import into

df = pd.DataFrame({'a': range(5)})
with pd.HDFStore('test.h5') as store:
    store['test'] = df

into('sqlite:///:memory:::test', 'hdfstore://test.h5::/test')
mrocklin commented 9 years ago

Great. thanks!

On Tue, Feb 3, 2015 at 11:14 AM, Matt Davis notifications@github.com wrote:

I can reproduce the error with something as simple as:

import pandas as pdfrom into import into

df = pd.DataFrame({'a': range(5)})with pd.HDFStore('test.h5') as store: store['test'] = df

into('sqlite:///:memory:::test', 'hdfstore://test.h5::/test')

— Reply to this email directly or view it on GitHub https://github.com/ContinuumIO/into/issues/63#issuecomment-72714604.

mrocklin commented 9 years ago

OK, fixed in https://github.com/ContinuumIO/into/commit/69906d7e048c56611d97f1ae036279d20406d5fb

The issue was that the FrameFixed HDFStore format (what you get by default use of to_hdf) was marked as "possibly larger than memory" but only had one route out of it, namely through pd.DataFrame. When migrating between two possibly-larger-than-memory formats into restricts itself to the subgraph of only possibly-larger-than-memory nodes. In this subgraph, FrameFixed is isolated.

I've removed the marking that FrameFixed is possibly out-of-memory. This is de-facto true as it can only be created by an in-memory format and doesn't support partitioned/blocked access.

mrocklin commented 9 years ago
In [1]: import pandas as pd

In [2]: from into import into

In [3]: 

In [3]: df = pd.DataFrame({'a': range(5)})

In [4]: with pd.HDFStore('test.h5') as store:
   ...:         store['test'] = df
   ...:     

In [5]: into('sqlite:///:memory:::test', 'hdfstore://test.h5::/test')
Out[5]: Table('test', MetaData(bind=Engine(sqlite:///:memory:)), Column('a', BigInteger(), table=<test>, nullable=False), schema=None)

In [6]: into(list, _)
Out[6]: [(0,), (1,), (2,), (3,), (4,)]
jiffyclub commented 9 years ago

Confirmed, thanks!