amashadihossein / dpbuild

https://amashadihossein.github.io/dpbuild/
GNU General Public License v3.0
4 stars 1 forks source link

Add support for deploying objects in arrow/parquet format #100

Open bkutlu opened 2 weeks ago

bkutlu commented 2 weeks ago

This is similar to issue (#99)

amashadihossein commented 2 weeks ago

Thanks for opening this issue! Use of parquet makes a lot of sense. A while back, before deciding on migrating to the new pins, I started a package, tbit, that was intended to be parquet based and more fit to daapr workflows than pins. The base idea was to move away from RDS, make things parquet centric and in the process make the data products language agnostic for the users as well. Roughly speaking, if we could think of a data product as a list of tables. The tables would be parquet and the list a lightweight JSON or yaml file pointing to those parquets. This could bring lots of benefits, with size and read speed of dps but also open door for extending the consumer base for daapr beyond R. As this was a concrete chunk of development, a stand-alone package, tbit, seemed like a good way to go rather than adding the code into dappr packages. This package could potentially even handle some of the logics that are now handled by the other daapr packages, making the daapr packages less bulky.
I am glad we went ahead with updated pins as that enabled a quicker turn-around to re-establishing the base functionality, but if we are thinking of improvement beyond base functionality, it might be worthwhile to reconsider this and see if we can envision a path there.