Closed nwagner84 closed 1 year ago
Similar to #367 this might better be done in an independent tool, if needed at all.
Personally, the data analysis workflow with pica-rs never ends in pica-rs for me. There is always a step, where I need to load the data into a program, to do some more tidying, reshaping etc. I wouldn't turn pica-rs in a full data analysis software. However, I think there is a good argument to make, that pica-rs should offer interfaces that make downward analysis more convenient. CSV as an interface is quite universal, but has severe limitations, when in comes to read-in-time and disk space usage. In 2016 there was an initiative put forward by RStudio and Apache Arrow, that introduced the feather file format. This was developed as an on-disk file format to increase the interoperability between R and Python. By now I think it has matured as a file format and would be, similar to parquet, a good interface format to increase the usability of pica-rs in downward analysis. Generally, I believe apache arrow is a good direction to go, with regard to columnar interface formats.
Oh, I did not know Apache Arrow, this looks interesting. I can imagine loading a dump into memory and querying it with filter
and select
as supported by command line from an input stream - actually this would be a PICA+ database such as CBS but with other capabilities and #363 could be an interface to it. I'm curious about actual gain of speed in using in-memory database.
My planned use case is to load a 150GB dump of PICA+ data into a key-value store and quickly access and modify records identified by their PPN. As this is too much for an in-memory datbase, it looks like key-value storage engines such as RockDB are suitable. One level up UKV looks good and supports Apache Arrow in some way but there is no Rust binding yet. As long as every read/write uses a PPN, it may also help to reduce PPN to internal integer by removal of the checksum. Anyway some support of a key-value may be helpful.