blaze / odo

Data Migration for the Blaze Project
http://odo.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
1.01k stars 138 forks source link

Is there a way to find out what steps odo will take? #586

Open ghost opened 7 years ago

ghost commented 7 years ago

I have a pandas.DataFrame and I want to send it to a remote sql database. I'm not sure if it's going to do something fast using \copy or INSERT ... VALUES or instead something slow using pandas.DataFrame.to_sql or sqlalchemy's executemany.

Is there a way I can find out what it's doing? If it's doing something slow, is there a way to hint at something faster?

llllllllll commented 7 years ago

You can use convert.path(src, dst) to see the steps odo will take:

In [32]: from odo import convert

In [33]: convert.path(sa.Table, pd.DataFrame)
Out[33]: 
[(sqlalchemy.sql.schema.Table,
  collections.abc.Iterator,
  <function odo.backends.sql.sql_to_iterator>),
 (collections.abc.Iterator,
  odo.chunks.chunks(pandas.DataFrame),
  <function odo.convert.iterator_to_DataFrame_chunks>),
 (odo.chunks.chunks(pandas.DataFrame),
  pandas.core.frame.DataFrame,
  <function odo.convert.chunks_dataframe_to_dataframe>)]

Hopefully this helps!

ghost commented 7 years ago

@llllllllll Thanks for responding; convert.path looks like a helpful tool!

In this particular case, I'm trying to load my dataframe into the remote sql database, so I would guess it's convert.path(pd.DataFrame, sqlalchemy.Table). But that gives

>>> import pandas as pd
>>> import sqlalchemy as sa
>>> from odo import convert
>>> convert.path(pd.DataFrame, sqlalchemy.Table)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/home/user/Documents/project/venv/lib/python3.5/site-packages/networkx/algorithms/shortest_paths/weighted.py in dijkstra_path(G, source, target, weight)
     79     try:
---> 80         return path[target]
     81     except KeyError:

KeyError: <class 'sqlalchemy.sql.schema.Table'>

During handling of the above exception, another exception occurred:

NetworkXNoPath                            Traceback (most recent call last)
<ipython-input-31-3da53a9187b8> in <module>()
----> 1 convert.path(pd.DataFrame, sa.Table)

/home/user/Documents/project/venv/lib/python3.5/site-packages/odo/core.py in path(self, *args, **kwargs)
     39 
     40     def path(self, *args, **kwargs):
---> 41         return path(self.graph, *args, **kwargs)
     42 
     43     def __call__(self, *args, **kwargs):

/home/user/Documents/project/venv/lib/python3.5/site-packages/odo/core.py in path(graph, source, target, excluded_edges, ooc_types)
     90                                     if issubclass(n, oocs)])
     91     with without_edges(graph, excluded_edges) as g:
---> 92         pth = nx.shortest_path(g, source=source, target=target, weight='cost')
     93         result = [(src, tgt, graph.edge[src][tgt]['func'])
     94                   for src, tgt in zip(pth, pth[1:])]

/home/user/Documents/project/venv/lib/python3.5/site-packages/networkx/algorithms/shortest_paths/generic.py in shortest_path(G, source, target, weight)
    136                 paths=nx.bidirectional_shortest_path(G,source,target)
    137             else:
--> 138                 paths=nx.dijkstra_path(G,source,target,weight)
    139 
    140     return paths

/home/user/Documents/project/venv/lib/python3.5/site-packages/networkx/algorithms/shortest_paths/weighted.py in dijkstra_path(G, source, target, weight)
     81     except KeyError:
     82         raise nx.NetworkXNoPath(
---> 83             "node %s not reachable from %s" % (source, target))
     84 
     85 

NetworkXNoPath: node <class 'pandas.core.frame.DataFrame'> not reachable from <class 'sqlalchemy.sql.schema.Table'>
llllllllll commented 7 years ago

Loading a dataframe into a table uses append, which is not itself a network dispatcher, instead it is a regular multiply dispatched function which converts the input into an iterator and then appends the iterator to the the table.

There is no "dry-run" for append but I agree that this would be a useful feature.

ghost commented 7 years ago

If I understand, the capability to send a local dataframe or csv to remote sql database quickly is forthcoming but not yet as fast as possible. I get the impression that sending a dataframe sends one row at a time and sending a csv isn't possible. Is that right? I'd love to use odo for this if I can. Thanks for your help!