Open LunarEngineer opened 7 months ago
Swapping to ibis would allow removal of Pandas. Using ibis, pyarrow, and duckdb as standard interface, but allowing for configuration, would make this much more widely useful.
As an example, loading fileids might look something like this:
import ibis
import pyarrow.dataset as ds
from thethingstore import ThingStore
from typing import Any, Iterable, Mapping, Union
def load_fileids(
data_layer: ThingStore,
fileid: Union[str, Iterable[str]],
representation: {"dataset", "inmemory"} = "dataset"
) -> ds.Dataset:
"""Load fileid(s)
TODO: Move to ThingStore.
Parameters
----------
data_layer: ThingStore
This is a ThingStore compliant data layer.
fileid: Union[str, Iterable[str]]
Fileid(s) to load.
representation: {"dataset", "inmemory"} = "dataset"
Whether to load data to dataset or to memory.
Returns
-------
dataset: Union[ds.Dataset, ibis.Table]
"""
if isinstance(fileid, str):
thing_dataset = ds.dataset(data_layer.get_dataset(fileid))
else:
thing_dataset = ds.dataset([data_layer.get_dataset(_) for _ in fileid])
if representation == "dataset":
return thing_dataset
elif representation == "inmemory":
return ibis.memtable(thing_dataset.to_table())
Note: The verbiage expressed here are that of the author and are not representative of State Farm.
Is your feature request related to a problem? Please describe. Ibis is a more appropriate in memory representation. It is more performant and friendly than Pandas for DataFrame operations.
Describe the solution you'd like Replace Pandas with Ibis
Describe alternatives you've considered There are a lot of DataFrame APIs, but the standard that Ibis implements is very friendly and efficient. It allows swapping backends at will.