Avoid copying data back and forth between the Python runtime and the C++ library

unzvfu commented 6 years ago

~According to the benchmark, copying to and from the C++ library currently takes 99% of the time.~ Edit: Not true. Copying does take a lot of time in some places (see #79), but the copying mentioned below is not "99%" bad, more like "5%" bad. Really the C++ library should just be passed an address to the memory it needs to look at directly, making the copy redundant. This may take advantage of a cleaner interface to the arrays provided by a resolution to issue #64.

This issue partly supersedes issue #29 where Brian says:

We also are dealing with "nice" python bitarrays which require some manipulation (1) before passing into native code. We might want to consider adding an accelerated interface that takes our custom bit packed data as plain python bytes.

1: [ffi.new("char[128]", bytes(f[0].tobytes())) for f in filters1]

I've started experimenting in

Branch feature-chunked-speedup for a C implementation of many x many comparisons.

Branch feature-direct-cffi builds ontop of that with a look at accessing bitarray data from C without a memcopy. Only does a bitarray popcount for now.
# Assume ba is a bitarray
addr = ba.buffer_info()[0]
pntr = ffi.cast("char *", addr)
lib.popcount(pntr)

The comments in issue #18 might still be relevant.

Aha! Link: https://csiro.aha.io/features/ANONLINK-68

hardbyte commented 6 years ago

We still need to measure just how much of the time is taken by memory copying and type conversions.

hardbyte commented 6 years ago

There was some further discussion on switching to array - https://github.com/n1analytics/anonlink/pull/121#discussion_r200878288

data61 / anonlink

Avoid copying data back and forth between the Python runtime and the C++ library #66