blaze / datashape

Language defining a data description protocol
BSD 2-Clause "Simplified" License
183 stars 65 forks source link

Discovery #65

Closed mrocklin closed 10 years ago

mrocklin commented 10 years ago
In [1]: from datashape.discovery import discover

In [2]: data = [{'name': 'Alice', 'amount': '100'},
   ...:         {'name': 'Bob', 'amount': '200'},
   ...:         {'name': 'Charlie', 'amount': '300'}]

In [3]: discover(data)
Out[3]: dshape("3 * { amount : int64, name : string }")

We can handle missing data in a variety of cases as well. Here is the result of calling discover on the kiva_tiny dataset living in blaze/samples/server/arrays/lenders.json

In [3]: with open('lenders.json') as f:
    data = json.load(f)
In [4]: discover(data)
Out[4]: dshape("52 * { country_code : option[string], image : { id : int64, template_id : int64 }, invitee_count : int64, inviter_id : string, lender_id : string, loan_because : option[string], loan_count : int64, member_since : datetime, name : string, occupation : option[string], occupational_info : option[string], personal_url : option[string], uid : string, whereabouts : option[string] }")
mrocklin commented 10 years ago

Mostly I'm just curious about this approach. Once we can discover on basic types then blaze.data descriptors will rely on this functionality after doing basic parsing on a subset of their data. Discover will also be extended to work on numpy arrays and pandas dataframes.

mrocklin commented 10 years ago

This now supports datetimes with dateutil.parsers.parse and numpy with datashape.from_numpy.

I've removed the WIP label. A lot of the work now needs to happen in blaze.data.

mrocklin commented 10 years ago

Looks like dateutil is non-standard. I'd still like to go ahead with it for now, adding this as a dependency. It seems to be in the main anaconda distribution though, so it's somewhat-standard.

mrocklin commented 10 years ago

Looks like there is an issue with dateutil. On conda it's named dateutil while on PyPI it's named python-dateutil.

mwiebe commented 10 years ago

pandas uses dateutil for this stuff, using it will at least match behavior people are used to from there.

mrocklin commented 10 years ago

OK, I've pushed up the change so that python-dateutil is in requirements.txt. This means that I'm preferring PyPI over conda. We can't support both automatically from a single requirements.txt. We'll need to either drop dateutil or specialize our build scripts.

mrocklin commented 10 years ago

@mwiebe should I use an assert_equals function? If so from where should I import it? The tests have been nose/py.test agnostic so far, should we select one?

mrocklin commented 10 years ago

Also, here is a page showing pytest magic. They must inspect assert statements and generate other code.

mrocklin commented 10 years ago

http://pytest.org/latest/assert.html

mwiebe commented 10 years ago

The assertion rewriting stuff looks great, another reason to switch to pytest. With that working, I would favour your preferred assert x == y syntax.

mrocklin commented 10 years ago

This is again ready for review. @mwiebe you're probably the best candidate. I put comments in a couple places to direct your attention. You might also want to try it out on a dataset or look at the results from kiva now in the PR header.

mwiebe commented 10 years ago

+1 LGTM

mrocklin commented 10 years ago

@mwiebe I added your test and the bool/int/string relationship. This also exposed an error with my current system. Your test was good because the result was neither of the input types.

mrocklin commented 10 years ago

OK, merging this. I expect that we'll run into issues, but we won't know until we try.

mrocklin commented 10 years ago

Whoops. Forgot to add note about trouble cases.