blaze / blaze

NumPy and Pandas interface to Big Data
blaze.pydata.org
BSD 3-Clause "New" or "Revised" License
3.18k stars 389 forks source link

Slicing on Interactive Data with SQL does not work #1072

Closed Will-So closed 9 years ago

Will-So commented 9 years ago

Hi all,

Big fan of the project. I've started working on a 10 minutes to blaze tutorial (#423) and noticed that Slicing doesn't work for interactive objects. See here for an example.

I'd be happy to work on this but I still don't have a good grasp how exactly the Data function works. Any tips would be appreciated.

cpcloud commented 9 years ago

Hi Will, Thanks for the kind words. Slicing doesn't work for the SQL backend because the ordering of rows isn't predictable. Relational algebra deals with sets of tuples which don't have a deterministic ordering by default. Of course order by clauses can induce an order but there currently isn't a generic way to tell whether an expression is ordered. However maybe having generic metadata on the expression regarding its order isn't necessary. It might be possible to inspect the current sqlalchemy expression and see if there's an order by clause present and then slice and otherwise raise an error.

On Friday, May 1, 2015, Will So notifications@github.com wrote:

Hi all,

Big fan of the project. I've started working on a 10 minutes to blaze tutorial (#423 https://github.com/ContinuumIO/blaze/issues/423) and noticed that Slicing doesn't work for interactive objects. See here http://nbviewer.ipython.org/github/Will-So/Blaze-Workbook/blob/master/Slicing.ipynb for an example.

I'd be happy to work on this but I still don't have a good grasp how exactly the Data function works. Any tips would be appreciated.

— Reply to this email directly or view it on GitHub https://github.com/ContinuumIO/blaze/issues/1072.

Best, Phillip Cloud

Will-So commented 9 years ago

Checking to see if an SQLAlchemy object has an order_by statement would make sense.

In the case of there being no order_by clause, a good fallback might be to treat the slice as df[offset:limit] and print a warning that the order is not guaranteed to be the same each time. Indeed, this is how SQLAlchemy handles its slice function (without the warning). This would satisfy most of the use cases for slicing.

Without functionality like this it will be difficult to explore rows outside of the first 10 using Blaze. This is also a problem with functions like distinct(). You have to iterate through the results in order to view more than 10 distinct items. e.g., to get a list of 50 distinct items:

items = []
for item in df.H.distinct():
    items.append(item)
    if len(items) == 50:
        break
items

Rather than Pandas/Numpy which is just df.H.unique()[:50]. Operations like these are often one of the first things we do when examining a new dataset. Although this particular problem can be solved by allowing head() to return more than 10 items.

mrocklin commented 9 years ago

Head can return more than 10 items. What you're running into here is that the default repr function of interactive symbols slaps a head(10) on to the end of your expression to avoid dumping everything to the screen. If you actually want the full result you need to specify how you would like to represent it in memory. This might be done, for example, by

odo(expr.head(50), pd.DataFrame)

or

list(expr.head(50))

One could reinterpret your request as "If the user has already specified a head(...) length on their expression then please respect that when printing", which seems pretty reasonable.

On Fri, May 1, 2015 at 5:23 AM, Will So notifications@github.com wrote:

Checking to see if an SQLAlchemy object has an order_by statement would make sense.

In the case of there being no order_by clause, a good fallback might be to treat the slice as df[offset:limit] and print a warning that the order is not guaranteed to be the same each time. Indeed, this is how SQLAlchemy handles http://docs.sqlalchemy.org/en/improve_toc/orm/query.html its slice function (without the warning). This would satisfy most of the use cases for slicing.

Without functionality like this it will be difficult to explore rows outside of the first 10 using Blaze. This is also a problem with functions like distinct(). You have to iterate through the results in order to view more than 10 distinct items. e.g., to get a list of 50 distinct items:

items = []for item in df.H.distinct(): items.append(item) if len(items) == 50: break items

Rather than Pandas/Numpy which is just df.H.unique()[:50]. Operations like these are often one of the first things we do when examining a new dataset. Although this particular problem can be solved by allowing head() to return more than 10 items.

— Reply to this email directly or view it on GitHub https://github.com/ContinuumIO/blaze/issues/1072#issuecomment-98156115.

cpcloud commented 9 years ago

@Will-So The way to do this would be something like this:

@dispatch(Slice, Selectable)
def compute_up(expr, data, **kwargs):
    index = expr.index
    if isinstance(index, slice):
        start = index.start
        if start < 0:
            raise ValueError('start value of slice cannot be negative')
        stop = index.stop
        if stop < 0:
            raise ValueError('stop value of slice cannot be negative')
        if index.step is not None:
            raise ValueError('step parameter in slice objects not supported')
    elif isinstance(index, (np.integer, numbers.Integral)):
        if index < 0:
            raise ValueError('integer slice cannot be negative for the SQL backend')
        start = index
        stop = start + 1
    else:
        raise TypeError('type %r not supported for slicing' % type(index).__name__)
    return data.offset(start).limit(stop)
Will-So commented 9 years ago

Thanks for the detailed responses Phillip and Matthew. I now understand Blaze a better. I'll start working on a PR.