Closed marcmaxson closed 2 years ago
update, adding --parquet option to run_pipeline and make_pipeline kwargs, since that seems to be more efficient than pickle and exportable into a variety of high performance ML apps.
Looking at export option of arrow table:
import pandas as pd
df = pd.read_parquet('your_file.parquet')
schema = pa.Schema.from_pandas(df, preserve_index=False)
table = pa.Table.from_pandas(df, preserve_index=False)
sink = "myfile.arrow"
# Note new_file creates a RecordBatchFileWriter
writer = pa.ipc.new_file(sink, schema)
writer.write(table)
writer.close()```
for that, I'll have to add an optional dependency on apache arrow, that will throw an error if not installed, but not force users to install the package to run methylsuite.
big +1 for using parquet. Parquet is quite transferable (can be used from other languages including R), but also fast and standardized. E.g. you shouldn't expect that after updating pandas or python you can still read pickle created with previous versions.
Adam: It would make sense to use Apache Arrow instead of pickle as a file format, as NVIDIA has libraries that can load it directly to the GPU, also it's a language independent format. It would be important to make sure that the meta data file always has Sample_ID column.
Helpful to know. This format would be useless to me, and probably to a lot of vanilla python users who've never seen them, but we can add the option to support deep learning. I've never seen an Apache Arrow file.
Going from CSVs to pickles was a design choice that broke compatibility with R users, but made everything 100X faster when dealing with large data sets that can't fit into memory. Loading a stack of CSVs takes minutes compared to a few seconds with a py3 pickle.