offer Apache Arrow output option in methylprep.run_pipeline -- ideal for GPU machine learning use case

FoxoTech / methylprep

Python-based preprocessing software for Illumina methylation arrays

MIT License

34 stars 14 forks source link

offer Apache Arrow output option in methylprep.run_pipeline -- ideal for GPU machine learning use case #69

Closed marcmaxson closed 2 years ago

marcmaxson commented 4 years ago

Adam: It would make sense to use Apache Arrow instead of pickle as a file format, as NVIDIA has libraries that can load it directly to the GPU, also it's a language independent format. It would be important to make sure that the meta data file always has Sample_ID column.

Helpful to know. This format would be useless to me, and probably to a lot of vanilla python users who've never seen them, but we can add the option to support deep learning. I've never seen an Apache Arrow file.

Going from CSVs to pickles was a design choice that broke compatibility with R users, but made everything 100X faster when dealing with large data sets that can't fit into memory. Loading a stack of CSVs takes minutes compared to a few seconds with a py3 pickle.

marcmaxson commented 3 years ago

update, adding --parquet option to run_pipeline and make_pipeline kwargs, since that seems to be more efficient than pickle and exportable into a variety of high performance ML apps.

Looking at export option of arrow table:


import pandas as pd 

df = pd.read_parquet('your_file.parquet')

schema = pa.Schema.from_pandas(df, preserve_index=False)
table = pa.Table.from_pandas(df, preserve_index=False)

sink = "myfile.arrow"

# Note new_file creates a RecordBatchFileWriter 
writer = pa.ipc.new_file(sink, schema)
writer.write(table)
writer.close()```

for that, I'll have to add an optional dependency on apache arrow, that will throw an error if not installed, but not force users to install the package to run methylsuite.

arogozhnikov commented 2 years ago

big +1 for using parquet. Parquet is quite transferable (can be used from other languages including R), but also fast and standardized. E.g. you shouldn't expect that after updating pandas or python you can still read pickle created with previous versions.