ContinuumIO / cyberpandas

IP Address dtype and block for pandas
BSD 3-Clause "New" or "Revised" License
104 stars 23 forks source link

Is there any interest in supporting IPNetwork as well? #40

Open ddutt opened 5 years ago

ddutt commented 5 years ago

Hi,

I'm interested in IPNetwork as well. Is there any work being done on this already? If not, is there any interest in supporting it?

Thanks,

Dinesh

TomAugspurger commented 5 years ago

Seems like a reasonable feature. I don’t have any plans to work on it.

On Jun 18, 2019, at 12:09, Dinesh Dutt notifications@github.com wrote:

Hi,

I'm interested in IPNetwork as well. Is there any work being done on this already? If not, is there any interest in supporting it?

Thanks,

Dinesh

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ddutt commented 5 years ago

Cool. Any recommendations on providing the pull request provided I get to doing it?

Dinesh

ddutt commented 5 years ago

It seems best to create a new type called IPNetworkArray rather than fiddle with IPArray. Would you agree?

TomAugspurger commented 5 years ago

Yes, a separate ExtensionDtype and ExtensionArray makes the most sense.

Will the storage be essentially the same? An ndarray with a record dtype for the hi and lo bits?

I think you'll want to inherit from NumPyBackedExtensionArrayMixin: https://github.com/ContinuumIO/cyberpandas/blob/master/cyberpandas/base.py. I wrote that as a prototype for pandas, so it's a bit convoluted, but may be useful.

I suspect you find some duplicate functionality with IPArray. I wouldn't worry about deduplicating things too much at first.

ddutt commented 5 years ago

I was planning to store the thing as a CIDR string, /prefixlen. Would that work? I haven't written an extensions to Pandas dtypes before.

TomAugspurger commented 5 years ago

Hmm I'm not sure. For performance on operations, IPArray stores elements as a 128-bit integer, split across two columns of a NumPy array. This gives us NumPy's fast performance for operations like isna(), is_ipv4(), masking, etc. I'm not sure if you could do those kind of operations on a string, (but again, I don't know much about networking so I may be missing something).

On Tue, Jun 18, 2019 at 1:15 PM Dinesh Dutt notifications@github.com wrote:

I was planning to store the thing as a CIDR string, /prefixlen. Would that work? I haven't written an extensions to Pandas dtypes before.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContinuumIO/cyberpandas/issues/40?email_source=notifications&email_token=AAKAOIQIOMUPPOOYGPMRLODP3EQ5LA5CNFSM4HZCEBW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX7QE4A#issuecomment-503251568, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIRQ747C2OIUWHIXKU3P3EQ5LANCNFSM4HZCEBWQ .

seibert commented 5 years ago

It would be convenient for a number of operations if the representation of the network could be fixed width. Packed variable length strings are likely to make things complicated.

ddutt commented 5 years ago

So I have a prelim implementation working, but not sure of its efficiency. I stored the network as an IPv4Network or IPv6Network object itself instead of trying to do a string or any other fancy implementation. For things like is_ipv4, the IPArray approach is much faster and for things like is_multicast, is_link_local, the approach I've taken seems faster from a simple %timeit on a few things perspective.

I'd like some eyes on what I've done to see if there are changes to make the thing better. What do you suggest? I still need a lot of cleaning up if all looks good.

Dinesh

ddutt commented 5 years ago

I've got df.a.astype('ipnetwork'), df.query('a.ipnet.is_link_local') etc. working

Dinesh

TomAugspurger commented 5 years ago

On Wed, Jun 19, 2019 at 2:39 PM Dinesh Dutt notifications@github.com wrote:

So I have a prelim implementation working, but not sure of its efficiency. I stored the network as an IPv4Network or IPv6Network object itself instead of trying to do a string or any other fancy implementation. For things like is_ipv4, the IPArray approach is much faster and for things like is_multicast, is_link_local, the approach I've taken seems faster from a simple %timeit on a few things perspective.

On IPArray many methods, including is_multicast and is_link_local, convert from the fast ndarray of integers to the slow ndarray of IPv4Address objects. So I would expect those to be roughly the same (a bit slower, since IPArray.is_link_local needs to do the conversion first). Some of these could likely be implemented without converting to objects first.

That said, as a prototype this is probably fine for now. You may just want to make the .data attribute private in case the internal representation changes in the future.

I'd like some eyes on what I've done to see if there are changes to make

the thing better. What do you suggest? I still need a lot of cleaning up if all looks good.

A PR is probably best, though I likely won't have time to review until after the next distributed and pandas releases.

Dinesh

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContinuumIO/cyberpandas/issues/40?email_source=notifications&email_token=AAKAOISPIP6IXGFB2ER3QY3P3KDPNA5CNFSM4HZCEBW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDCAZI#issuecomment-503717989, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIUVYWTPCMHHYDFKC6TP3KDPNANCNFSM4HZCEBWQ .

ddutt commented 5 years ago

Thanks Tom. Since you won't have time for a bit, let me clean it up and ensure the data is private for now. I'll look at what I can do (maybe add a third byte which is the prefixlen to your ip address model) to make this more efficient. Is Cython a possibility here? I'll send a PR after I'm comfortable its cleaned up.

Does that make sense?

Dinesh

TomAugspurger commented 5 years ago

Sounds good.

Not sure about Cython, unless it's really necessary, since that complicates the build / packaging so much.

On Wed, Jun 19, 2019 at 3:00 PM Dinesh Dutt notifications@github.com wrote:

Thanks Tom. Since you won't have time for a bit, let me clean it up and ensure the data is private for now. I'll look at what I can do (maybe add a third byte which is the prefixlen to your ip address model) to make this more efficient. Is Cython a possibility here? I'll send a PR after I'm comfortable its cleaned up.

Does that make sense?

Dinesh

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContinuumIO/cyberpandas/issues/40?email_source=notifications&email_token=AAKAOIWIVKX4IQUQL7NFYDTP3KF6DA5CNFSM4HZCEBW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDDWJI#issuecomment-503724837, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIVY2G22YMA5YSVGHL3P3KF6DANCNFSM4HZCEBWQ .

ddutt commented 5 years ago

I've created a pull request. I've tried to follow the model you've followed for IPArray in terms of where the different functions live. I've also added support for max/min besides support for supernet_of and subnet_of.

giganteous commented 3 years ago

Hello, is there any progress on this issue?

mzpqnxow commented 3 years ago

@ddutt , @TomAugspurger did this go anywhere? I have a use-case for this and was curious if it's still "a thing"

TomAugspurger commented 3 years ago

Not really, I'm don't have time to work on cyberpandas these days.

ddutt commented 3 years ago

I don't have time to work on cyberpandas either. I found other ways to work around it. But if you have some code, I'm happy to test it.