ContinuumIO / cyberpandas

IP Address dtype and block for pandas
BSD 3-Clause "New" or "Revised" License
104 stars 23 forks source link

Python 2 error - to_ipaddress #32

Open siyer32 opened 6 years ago

siyer32 commented 6 years ago

Not sure if this is supported in Python 2.7.13, I got this error. Works fine in Python 3.6.5

appdata['sa'] = cypd.to_ipaddress(appdata['sa'])

63 '%r does not appear to be an IPv4 or IPv6 address. ' 164 'Did you pass in a bytes (str in Python 2) instead of' --> 165 ' a unicode object?' % address) 166 167 raise ValueError('%r does not appear to be an IPv4 or IPv6 address' %

AddressValueError: '10.44.129.135' does not appear to be an IPv4 or IPv6 address. Did you pass in a bytes (str in Python 2) instead of a unicode object?

siyer32 commented 6 years ago

Here is Python 3.6.5 output: appdata['sa'] = cypd.to_ipaddress(appdata['sa']) appdata.dtypes sa ip da object sp int64 dp int64 ipkt int64 ibyt int64 Application Label object

TomAugspurger commented 6 years ago

We either need to make a better error message here, or break with Python 2's ipaddress module.

In Python2, it expects unicode object when parsing a string IP Address like '192.168.1.1'. https://cyberpandas.readthedocs.io/en/latest/usage.html#parsing

In [13]: import pandas as pd

In [14]: from cyberpandas import to_ipaddress

In [15]: df = pd.DataFrame({"addr": ['192.168.1.1', '192.168.1.2']})

In [16]: to_ipaddress(df.addr)
---------------------------------------------------------------------------
AddressValueError                         Traceback (most recent call last)
<ipython-input-16-1f1c4ac488eb> in <module>()
----> 1 to_ipaddress(df.addr)

/Users/taugspurger/sandbox/cyberpandas/cyberpandas/parser.py in to_ipaddress(values)
     40         values = [values]
     41
---> 42     return IPArray(_to_ip_array(values))
     43
     44

/Users/taugspurger/sandbox/cyberpandas/cyberpandas/parser.py in _to_ip_array(values)
     59     elif not (isinstance(values, np.ndarray) and
     60               values.dtype == IPType._record_type):
---> 61         values = _to_int_pairs(values)
     62     return np.atleast_1d(np.asarray(values, dtype=IPType._record_type))
     63

/Users/taugspurger/sandbox/cyberpandas/cyberpandas/parser.py in _to_int_pairs(values)
     79         pass
     80     else:
---> 81         values = [ipaddress.ip_address(v)._ip for v in values]
     82         values = [unpack(pack(v)) for v in values]
     83     return values

/Users/taugspurger/miniconda3/envs/py27-ipaddr/lib/python2.7/site-packages/ipaddress.pyc in ip_address(address)
    163             '%r does not appear to be an IPv4 or IPv6 address. '
    164             'Did you pass in a bytes (str in Python 2) instead of'
--> 165             ' a unicode object?' % address)
    166
    167     raise ValueError('%r does not appear to be an IPv4 or IPv6 address' %

AddressValueError: '192.168.1.1' does not appear to be an IPv4 or IPv6 address. Did you pass in a bytes (str in Python 2) instead of a unicode object?

In [17]: to_ipaddress(df.addr.astype(unicode))
Out[17]: IPArray([u'192.168.1.1', u'192.168.1.2'])

So in literal code it should be u'192.168.1.1' instead of '192.168.1.1'. The current way is pretty unfriendly :/

seibert commented 6 years ago

By definition, IP address strings have to be ASCII (unlike hostnames), so I don't see a problem with to_ipaddress silently decoding Python 2 str to unicode assuming it is ASCII. Does that seem reasonable?

TomAugspurger commented 6 years ago

Does that seem reasonable?

Yeah, I think so.

siyer32 commented 6 years ago

Does this mean, the ip address passed have to be strings ? Most data (like the one I tested) that are captured from the devices are not strings.

seibert commented 6 years ago

There are other input methods describe in the docs. Python integers, or raw address in byte form (see IPArray.from_bytes)

TomAugspurger commented 6 years ago

To clarify things, let's use Python 3's terminology. "string" is a unicode string, and "bytes" is a bytestring.

The valid options are

Most data (like the one I tested) that are captured from the devices are not strings.

What does the raw data look like for you? If performance is a concern, the absolute fastest was is https://cyberpandas.readthedocs.io/en/latest/api.html#cyberpandas.IPArray.from_bytes