deutsche-nationalbibliothek / pica-rs

Tools to work with bibliographic records encoded in PICA+.
https://deutsche-nationalbibliothek.github.io/pica-rs/
European Union Public License 1.2
31 stars 5 forks source link

Support Apache Arrow format #383

Closed nwagner84 closed 1 year ago

nwagner84 commented 2 years ago
nichtich commented 2 years ago

Similar to #367 this might better be done in an independent tool, if needed at all.

mfakaehler commented 2 years ago

Personally, the data analysis workflow with pica-rs never ends in pica-rs for me. There is always a step, where I need to load the data into a program, to do some more tidying, reshaping etc. I wouldn't turn pica-rs in a full data analysis software. However, I think there is a good argument to make, that pica-rs should offer interfaces that make downward analysis more convenient. CSV as an interface is quite universal, but has severe limitations, when in comes to read-in-time and disk space usage. In 2016 there was an initiative put forward by RStudio and Apache Arrow, that introduced the feather file format. This was developed as an on-disk file format to increase the interoperability between R and Python. By now I think it has matured as a file format and would be, similar to parquet, a good interface format to increase the usability of pica-rs in downward analysis. Generally, I believe apache arrow is a good direction to go, with regard to columnar interface formats.

nichtich commented 2 years ago

Oh, I did not know Apache Arrow, this looks interesting. I can imagine loading a dump into memory and querying it with filter and select as supported by command line from an input stream - actually this would be a PICA+ database such as CBS but with other capabilities and #363 could be an interface to it. I'm curious about actual gain of speed in using in-memory database.

nichtich commented 1 year ago

My planned use case is to load a 150GB dump of PICA+ data into a key-value store and quickly access and modify records identified by their PPN. As this is too much for an in-memory datbase, it looks like key-value storage engines such as RockDB are suitable. One level up UKV looks good and supports Apache Arrow in some way but there is no Rust binding yet. As long as every read/write uses a PPN, it may also help to reduce PPN to internal integer by removal of the checksum. Anyway some support of a key-value may be helpful.