blaze / odo

Data Migration for the Blaze Project
http://odo.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
997 stars 138 forks source link

odo"izable" object #317

Open femtotrader opened 8 years ago

femtotrader commented 8 years ago

Hello,

I'm looking for some help to make an object odo"izable" (able to be a source for odo).

>> import datapackage
>> # Note trailing slash is important for data.okfn.org
>> datapkg = datapackage.DataPackage('http://data.okfn.org/data/cpi/')

>> odo(datapkg, pd.DataFrame)

raises

KeyError: <class 'datapackage.datapackage.DataPackage'>

see https://github.com/trickvi/datapackage/issues/45

is it possible to inherit a parent class to provide an odo"izable" object ?

Kind regards

llllllllll commented 8 years ago

You can extend odo's conversion graph by dispatching on convert, for example:

from odo import convert

@convert.register(pd.DataFrame, datapackage.DataPackage)
def datapackage_to_dataframe(pkg):
    # function that takes a datapackage and returns a dataframe
    ...

This will then allow you to make this conversion by using odo

To make datapackage seem more "native", you might want to also create dispatchers for append and discover.

cpcloud commented 8 years ago

@femtotrader Glad you asked! There's some nice documentation on how to do this here. Let us know if you have any questions.

femtotrader commented 8 years ago
In [33]: datapkg.data
Out[33]: <itertools.chain at 0x104a2a898>

datapkg.data is an itertools.chain. what is according you the best way to integrate with odo.

Do you really think that doing something like

pd.DataFrame(list(pkg.data))

is a good idea ?

cpcloud commented 8 years ago

you could just return the data attribute, like this:

from collections import Iterator

@convert.register(Iterator, DataPackage)
def datapackage_to_iterator(datapkg, **kwargs):
    return datapkg.data
femtotrader commented 8 years ago

I try this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import datapackage

# Note trailing slash is important for data.okfn.org
datapkg = datapackage.DataPackage('http://data.okfn.org/data/cpi/')

assert datapkg.title == "Annual Consumer Price Index (CPI)"
assert datapkg.description == "Annual Consumer Price Index (CPI) for most countries in the world. Reference year is 2005."
cpi_sum = sum([row['CPI'] for row in datapkg.data])
assert cpi_sum == 405442.60078415077

from collections import Iterator

from odo import convert
@convert.register(Iterator, datapackage.DataPackage)
def datapackage_to_iterator(datapkg, **kwargs):
    return datapkg.data

from odo import odo

for row in odo(datapkg, Iterator):
    print(row)

import pandas as pd
df = odo(datapkg, pd.DataFrame)
print(df)

I thought that I only need to register an iterator as source but it doesn't seems to be enough to be able to build a DataFrame (or a CSV file or a JSON file)

odo(datapkg, pd.DataFrame)

raises

TypeError: list indices must be integers, not str

Any idea ?

femtotrader commented 8 years ago

I wonder if that's really necessary to do:

import pandas as pd
@convert.register(pd.DataFrame, datapackage.DataPackage)
def datapackage_to_dataframe(datapkg, **kwargs):
    return pd.DataFrame(list(datapkg.data))

df = odo(datapkg, pd.DataFrame)
print(df)
llllllllll commented 8 years ago

Can you provide the full stack of that type error?

femtotrader commented 8 years ago
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example.py", line 40, in <module>
    df = odo(datapkg, pd.DataFrame)
  File "//anaconda/lib/python3.4/site-packages/odo/odo.py", line 90, in odo
    return into(target, source, **kwargs)
  File "//anaconda/lib/python3.4/site-packages/multipledispatch/dispatcher.py", line 164, in __call__
    return func(*args, **kwargs)
  File "//anaconda/lib/python3.4/site-packages/odo/into.py", line 25, in into_type
    return convert(a, b, dshape=dshape, **kwargs)
  File "//anaconda/lib/python3.4/site-packages/odo/core.py", line 30, in __call__
    return _transform(self.graph, *args, **kwargs)
  File "//anaconda/lib/python3.4/site-packages/odo/core.py", line 46, in _transform
    x = f(x, excluded_edges=excluded_edges, **kwargs)
  File "//anaconda/lib/python3.4/site-packages/odo/convert.py", line 215, in iterator_to_DataFrame_chunks
    df = convert(pd.DataFrame, first, **kwargs)
  File "//anaconda/lib/python3.4/site-packages/odo/core.py", line 30, in __call__
    return _transform(self.graph, *args, **kwargs)
  File "//anaconda/lib/python3.4/site-packages/odo/core.py", line 46, in _transform
    x = f(x, excluded_edges=excluded_edges, **kwargs)
  File "//anaconda/lib/python3.4/site-packages/odo/convert.py", line 166, in list_to_numpy
    seq = list(records_to_tuples(dshape, seq))
  File "//anaconda/lib/python3.4/site-packages/odo/utils.py", line 212, in records_to_tuples
    return get(ds.measure.names, data)
  File "//anaconda/lib/python3.4/site-packages/toolz/itertoolz.py", line 400, in get
    return operator.itemgetter(*ind)(seq)
TypeError: list indices must be integers, not str
femtotrader commented 8 years ago

I also tried

@convert.register(Iterator, datapackage.DataPackage)
def datapackage_to_iterator(datapkg, **kwargs):
    return datapkg.get_data(datapkg.resources[0])

datapkg.get_data(datapkg.resources[0]) returns a generator but it also raises same exception

but I noticed that

In [29]: discover(datapkg.resources[0])
Out[29]:
dshape("""{
  datapackage_uri: string,
  format: string,
  is_local: bool,
  mediatype: string,
  name: string,
  schema: {
    fields: 4 * {
      description: ?string,
      format: ?string,
      name: string,
      type: string
      }
    },
  url: string
  }""")

and

In [30]: discover(datapkg.data)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-30-6707a7ba048d> in <module>()
----> 1 discover(datapkg.data)

//anaconda/lib/python3.4/site-packages/multipledispatch/dispatcher.py in __call__(self, *args, **kwargs)
    162             self._cache[types] = func
    163         try:
--> 164             return func(*args, **kwargs)
    165
    166         except MDNotImplementedError:

//anaconda/lib/python3.4/site-packages/datashape/discovery.py in discover(o, **kwargs)
     52         return from_numpy(o.shape, o.dtype)
     53     raise NotImplementedError("Don't know how to discover type %r" %
---> 54                               type(o).__name__)
     55
     56

NotImplementedError: Don't know how to discover type 'chain'
femtotrader commented 8 years ago

I'm also looking for some help to convert a JSON Table Schema to DataShape

In [140]: datapkg.resources[0]['schema']['fields']
Out[140]:
[{'name': 'Country Name', 'type': 'string'},
 {'name': 'Country Code', 'type': 'string'},
 {'format': 'yyyy', 'name': 'Year', 'type': 'date'},
 {'description': 'CPI (where 2005=100)', 'name': 'CPI', 'type': 'number'}]

http://dataprotocols.org/data-packages/#schemas-property http://dataprotocols.org/json-table-schema/

cpcloud commented 8 years ago

@femtotrader can you show list(datapkg.data)[0]

femtotrader commented 8 years ago
In [42]: list(datapkg.data)[0]
Out[42]:
{'CPI': 89.1695876693231,
 'Country Code': 'AFG',
 'Country Name': 'Afghanistan',
 'Year': datetime.date(2004, 1, 1)}

In [43]: type(list(datapkg.data)[0])
Out[43]: dict

A new project to convert JSON Table Schema <--> Datashape is available here https://github.com/okfn/jts-datashape