diana-hep / oamap

Perform high-speed calculations on columnar data without creating intermediate objects.
BSD 3-Clause "New" or "Revised" License
81 stars 11 forks source link

nested types in pandas #9

Open martindurant opened 5 years ago

martindurant commented 5 years ago

How difficult would it be to have nested oamap structures as a column in pandas using the extension types interface? I could see that as being a nice win-both-ways of the normal pandas tabular analysis and descending into the nested structures with fast numba-jit.

As a side issue, has there been any string functionality? I.e., if a leaf node type is string, is there anything that you can do with that within a numba function?

jpivarski commented 5 years ago

Thanks for pointing this out— I didn't know that Pandas has extension types and it would definitely be a good idea to make awkward-array aware of it. (They should be both Numba extensions and Pandas extensions.) The development will be in the awkward-array repo, though.

As for strings, I've been representing them as jagged arrays of characters (in awkward array; OAMap's terminology is a List(Primitive(uint8))).

martindurant commented 5 years ago

Yes, I didn't know where things stood with awkward versus oamap. Actually operating on the strings may be problematic, however, given numba limitations. The rust string (utf8) API is surprisingly complete and maybe would be the best thing to leverage for functions like startswith, replace or find - but now you need to care about creating new arrays within the jit-function, rather than just applying logic and aggregating.