abdenlab / oxbow

Read specialized NGS formats as data frames in R, Python, and more.
https://lifeinbytes.substack.com/p/breaking-out-of-bioinformatic-data-silos
Apache License 2.0
59 stars 8 forks source link

Feature request: Return iterator of RecordBatches (in Python) #58

Open shenker opened 9 months ago

shenker commented 9 months ago

Oxbow's Python functions (read_bam, etc.) currently return a bytes object. It would be great if they instead returned an iterator of pa.RecordBatch objects instead. The goal here would be to allow reading files in chunks (instead of loading the whole file in memory), and also to return PyArrow objects (that can be turned into pa.Tables, polars/pandas dataframes, etc.) instead of bare bytes objects. The desired chunk size (in number of rows? in bytes?) would ideally be exposed as a kwarg.