blaze / blaze

NumPy and Pandas interface to Big Data
blaze.pydata.org
BSD 3-Clause "New" or "Revised" License
3.18k stars 390 forks source link

ENH: Support joining on strings with different lengths #1189

Open TomAugspurger opened 9 years ago

TomAugspurger commented 9 years ago

This seems like it should be possible, at least for some backends.

from sqlalchemy import create_engine
import blaze as bz
import pandas as pd
import odo

engine = create_engine('sqlite:///db.db')
ds100 = bz.dshape("3 * {a: ?string[100]}")
ds200 = bz.dshape("3 * {a: ?string[200]}")

df = pd.DataFrame({'a': ['a', 'b', 'c']})

odo.odo(df, str(engine.url) + '::tmp100', dshape=ds100)
odo.odo(df, str(engine.url) + '::tmp200', dshape=ds200)
d = bz.Data(str(engine.url))

bz.join(d.tmp100, d.tmp200, on_left='a', on_right='a')

Raises with

TypeError                                 Traceback (most recent call last)
<ipython-input-7-74a2460a2e43> in <module>()
----> 1 bz.join(d.tmp100, d.tmp200, on_left='a', on_right='a')

/Users/tom.augspurger/Envs/py3/lib/python3.4/site-packages/blaze/expr/collections.py in join(lhs, rhs, on_left, on_right, how, suffixes)
    435                          (lhs, rhs))
    436     if types_of_fields(on_left, lhs) != types_of_fields(on_right, rhs):
--> 437         raise TypeError("Schema's of joining columns do not match")
    438     _on_left = tuple(on_left) if isinstance(on_left, list) else on_left
    439     _on_right = (tuple(on_right) if isinstance(on_right, list)

TypeError: Schema's of joining columns do not match
In [9]: odo.__version__
Out[9]: '0.3.3'

In [10]: bz.__version__
Out[10]: '0.8.2'
TomAugspurger commented 9 years ago

That was my first first thought. Although if its an inner join then the smaller should work.

I looked for a bit, but didn’t find much documentation.

On Jul 27, 2015, at 10:47 AM, Joe Jevnik notifications@github.com wrote:

In this case, would the schema (not sql schema) of the result be a string with a fixlen that is the max of the two string types being joined?

— Reply to this email directly or view it on GitHub https://github.com/ContinuumIO/blaze/issues/1189#issuecomment-125250555.

llllllllll commented 9 years ago

Accidently removed the comment. For other readers: I asked if the schema would be a string with a fixlen that is the max of the two.

Why would in inner join use the smaller?

Also, this would work with how we currently join numpy arrays and pandas dataframes so I think this is a good generalization as long as we understand what the resulting type is.

TomAugspurger commented 9 years ago

Inner join could use the smaller since it's going to contain the intersection of the join column. So by definition you can only join on strings that are the smaller of the two. I'm not sure if this is a good idea though. I think using the larger is fine though.

I need to test or find docs on how SQL handles joining here though

ghost commented 9 years ago

I decided to run a few experiments to see how sqlite handles some situations:

create table mytest (
id  varchar(20),
length int);

insert into mytest values ('1', 1);
insert into mytest values ('1 ', 2);
insert into mytest values ('1  ', 3);
insert into mytest values ('1   ', 4);
insert into mytest values ('1     ', 5);
insert into mytest values ('1      ', 6);
insert into mytest values (' 1      ', 7);   -- this one contains a space to the left of 1

select a.*, b.*
from mytest a,
          mytest b
where a.id=b.id;

the result contains only 7 rows, where the id field match exactly. I assume this means that the join will be independent of the max size but is only depend on the 'value' of the id field, not what it could contain. So I'm guessing that for strings, we really don't need to compare size, but only that the values are equal. I assume this would also be true for int32 vs int64, etc as well. If need be, I can run some tests for an id of varchar(20) vs a varchar(30) and see if there are different results, but my gut says there won't be compared to the results shown above.

ghost commented 9 years ago

I also joined the table mytest with another table with an id of varchar(25), and I still get the same result. Therefore I do not think the max size of a datatype makes a difference.

ghost commented 9 years ago

Tom, I'm testing a fix I'm developing.

I think there might be a small issue with the code you posted. Should the join be:

bz.join(df.tmp100, df.tmp200, on_left='a', on_right='a')

If so, I'm still having issues getting this to work. When I run it, I get the following error:

Traceback (most recent call last):
  File "test.py", line 15, in <module>
    bz.join(df.tmp100, df.tmp200, on_left='a', on_right='a')
  File "/usr/local/lib/python3.4/dist-packages/pandas/core/generic.py", line 2150, in __getattr__
    (type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'tmp100'

Does anyone else get the same result?

Billy

TomAugspurger commented 9 years ago

@earney, sorry I left out the section where I create the actual blaze Data object. I've edited my original post to include

d = bz.Data(str(engine.url))
ghost commented 9 years ago

Thanks, that helps!

ghost commented 9 years ago

Here is what I have come up with so far.

Instead of

if types_of_fields(on_left, lhs) != types_of_fields(on_right, rhs):

we can do:

def join_compatible_types(left_types, right_types):
    for _i in range(len(left_types)):
        _left=str(left_types[_i].ty)
        _right=str(right_types[_i].ty)

        for _type in ('int', 'float', 'string'):
            if _left.startswith(_type) and not _right.startswith(_type):
               return False

    return True

and then replace the first set of code above with:

    if not join_compatible_types(types_of_fields(on_left, lhs), types_of_fields(on_right, rhs)):

string, float, and int are probably not the only types but are the most common, and I thought this would be a good start for a solution. What do you guys think? If this solution seems reasonable to you guys, I can add this to my forked repo, and can see if it passes all tests, and any extras we want to add. Thanks!

ghost commented 9 years ago

by the way, I ran this through the tests at blaze/blaze/tests/* and they seem to pass..